Multiple Imputation (MI) methods have been widely used in many scientific fields to address missing data issues. Several statistical software packages have implemented MI procedures. However, their performance varies. Attempts were made to compare performance of MI procedures in varying statistical packages. Because of the complexity of MI, comparisons were made based on specific settings, such as assuming missing patterns were monotonic missing, the missing variable was semi-continuous, or the missing data were simple artificial data. None of these comparisons addressed which methods are best used in a large data set, such as the SEER registry data with non-monotonic missing pattern and many variables. This study will investigate issues more specific to the SEER registry data when using MI methods to handle missing data. Through this study, guidance for researchers in the cancer registry community will be provided with regard to properly handling of the missing data issue in cancer registry data. Missing data is a frequent problem in most scientific studies and a common feature of large data sets in general and medical data sets in particular. It can cause bias or lead to inefficient analyses if not handled properly. Because of its high standard requirements, the SEER data have only a small fraction of missing data for most of the variables collected. However, one very important variable, the SEER Summary Stage (1977, 2000 and CS), contains a higher percentage of missing or unknown data, especially for certain cancer sites. For example, 9.8% of the lung cancer cases and 22% of the liver cancer cases were coded as unknown for the variable SEER Summary Stage 2000 for the 2001-2003 SEER data. The complete case method (listwise deletion) is the most commonly used method to address this missing data issue for data analysis among researchers in the cancer registry community. If missing data are not missing completely at random, using the complete case method will introduce bias and generate incorrect results. For the same study data mentioned above, the distributions of age at cancer diagnosis for cases with known stage and cases with unknown stage were significantly different ? 34% of cases were 75 years old or older for known stage while 54% of cases were 75 years old or older for unknown stage. This strongly suggests cases with unknown stage were not missing completely at random. Hence, the complete case method is not an ideal method to analyze the data. Coding the cases with unknown stage as a separate sub-category will certainly include all cases in data analysis, but unfortunately, severe bias has been found for this type of analysis when data are not missing completely at random. Compared to the complete case method, MI, one of the more sophisticated methods to handle missing data, provides superior estimates when data are missing at random. First proposed in 1978, MI has become an important and influential approach in the statistical analysis of missing data in recent years because it is easy to use and readily available in many statistical packages. MI replaces each missing value with a set of plausible values that represents the uncertainty about the most appropriate value to impute, then combines results from separate data analyses for each complete dataset to generate the final estimates. It has been suggested that MI often provides valid and robust inferences even when assumptions were not met.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Research and Development Contracts (N01)
Project #
Application #
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Kentucky
United States
Zip Code