Many subpopulations of special interest to public health, such as sex workers, are hard to survey because they are rare and would require a large number of screening interviews to generate a sufficient sample size or because they are stigmatized and unlikely to trust researchers with personal information. Respondent-driven sampling (RDS) is one of the most effective means of sampling such subpopulations, because it asks and incentivizes subpopulation members to recruit other members through their personal social networks and then weights the resultant sample to correct for biases induced by the sampling design and make inferences about univariate statistics that are, under certain conditions, generalizable to the subpopulation of interest. Hundreds of studies have been conducted using RDS, backed by over $166 million of federal funding. The basic methodology of RDS has been subjected to several methodological extensions, evaluations, and criticisms, but prior statistical developments have largely focused on improving estimators for univariate statistics (e.g., prevalence of a risk factor). We propose to extend prior methodological work on statistical estimation in RDS to develop accurate and efficient tools that will allow researchers to estimate the parameters of multivariate regression models which will enhance understandings of hard to survey subpopulations. The current practice of multivariate RDS estimation is ad hoc with researchers applying over 10 distinct approaches throughout the literature but offering little or no justification for the approach they chose. RDS methodologists have yet to establish best practices or evaluate the performance of these different approaches. We propose to perform this evaluation. By doing so, this project will enable future RDS studies to address multivariate research questions about hard to survey subpopulations, and it will add substantial value to the hundreds of RDS studies that have previously been funded and collected. The proposed project has two components that will provide guidance to researchers (and the public health community) about conducting multivariate analyses with RDS data and the tools to conduct these analyses. The first component consists of a series of simulation studies that evaluate the performance of the most popular multivariate RDS estimators. The simulation studies will be designed to explore the performance of the estimators across a range of theoretically ideal and more realistic RDS sampling scenarios as well as a diversity of network types. The second component involves the development and dissemination of software in two commonly used statistical packages (R and Stata) that implements the best performing multivariate estimators identified in the simulation studies. The data collected in RDS studies has vast untapped potential to contribute to understandings of specific risk factors in hard to survey populations and the multivariate tools we will develop as part of this proposal will help to unlock this potential.
This project will evaluate existing methodologies for estimating the parameters of multivariate regression models on samples collected with respondent-driven sampling (RDS). RDS has emerged as one of the premier data collection approaches for hard to survey subpopulations such as those at high risk of contracting HIV. While RDS has seen substantial methodological development around the estimation of disease prevalence and other univariate statistics, statistical development around the estimation of parameters from multivariate regression models with RDS data has previously escaped attention. Despite hundreds of RDS samples having been collected using more than $166 million in federal funding, there is deep uncertainty in the literature about the best approaches to estimating parameters from multivariate models with these data. Indeed, the literature offers no clear guidelines and, as such, we found more than 10 separate approaches to estimating parameters from these models with little consideration to which work best. This is unfortunate because many of the classic methods of data analysis in the social and public health sciences rely on multivariate regression models to understand risk factors, rule out alternative explanations, and test the robustness of results. We will use a simulation evaluation framework to analyze the properties of the most popular approaches to RDS regression estimation found in the literature, and we will develop and offer clear guidelines linked to shared statistical software that other researchers can apply to obtain unbiased and efficient parameters for multivariate models with RDS data.