Recent advances in genome-wide association studies (GWAS) have led to both an increase in the size of genetic data available and identification of important genetic variants responsible for a variety of diseases. Prediction for these genetic diseases has also become of paramount importance. However, prediction for big data such as GWAS is not trivial. A key obstacle in big data prediction is identifying (perhaps a small number of) variable sets that lead to good prediction when variable dimensionality can be extremely large. The project explores why a common approach towards prediction can often fail to deliver strong prediction rates. A novel, interaction-based and prediction-oriented approach to extracting hidden information contained in big data will be investigated. To improve prediction, a new criterion to guide the selection of variable sets will be developed.
Prioritizing predictivity, not significance, requires using the correct estimates of prediction rates and developing predictivity-based criteria to evaluate variable sets. The project offers a novel theoretical framework by characterizing what makes for highly predictive variable sets, and providing fundamental work towards a new criterion to identify these sets. In the framework of this research project, variable sets have theoretical (true) levels of predictivity, which can be estimated with appropriately designed sample-based measures. This framework is the first that seeks to develop estimators specific to a criterion of predictivity. Additionally, methods that encompass both marginal and joint effects will be investigated, and a candidate measure of predictivity will be studied. Four real data examples are analyzed to illustrate how final predictors found via the new approach compare to other approaches in the current literature.