The use of complex sampling, e.g. stratified multistage cluster sampling, in population based case-control studies is becoming more common. In addition to the cost- and time-effectiveness, the use of complex sample designs can also obtain representative samples from the study population and avoid the biased selection of controls and/or cases. To realize the full potential of these advances, however, there are at least two complexities introduced by the complex sampling, including 1) differential selection probabilities and 2) the intra-cluster correlations. As a result, the sample distribution can be different from the underlying population distribution from which the sample is selected. In this project, we will develop statistical methods accounting for the two complexities for the estimation of effects from genes, environmental factors and their interactions on the risk of complex diseases. Specifically, attracted by the efficiency advantage of the retrospective method, we will explore the assumptions of HWE and GE independence, and develop an efficient estimator suitable for the case-control study with a complex sample design. In practice, many case-control studies apply frequency matched designs, where controls are selected in numbers proportional to the number of cases within matching strata during the complex sampling. We will further incorporate the frequency-matching design into our proposed estimators. The proposed methods will be evaluated using simulations as well as two population-based case-control studies with complex sample designs. A unified software package will be developed.
This project proposes innovative statistical methods for the analysis of data from population-based case-control studies when controls are sampled with a complex sample design. The results of this project will contribute to the understanding of the interplay of the genetic susceptibility and environmental risk factors, and provide an important resource for designing future population-based case-control studies.