U.S. breast cancer survivors number 2.5 million, more than the survivors of any other cancer. Studies on how to improve survival and quality of life in this ever-growing population are critical in reducing the national cancer burden. The ability to identify second breast cancer events (i.e., breast cancer recurrences and second primary breast cancers) is critical for cancer survivorship research. In response to the National Cancer Institute's call for studies of cancer surveillance using health claims data, we propose to develop and validate algorithms to identify second breast cancer events from automated healthcare utilization data in order to minimize the need for expensive and time-consuming manual medical record review. Automated healthcare utilization data are becoming increasingly accessible;however, these sources have yet to be validated against gold-standard medical record abstraction for obtaining information on second breast cancer events. This work is significant and necessary since state tumor registries do not routinely collect information on cancer recurrences. The proposed study will be conducted using data from two integrated healthcare delivery systems within the Cancer Research Network (CRN): Group Health Cooperative (in western Washington State) and the Henry Ford Health System (in Detroit, Michigan). These healthcare systems have extensive automated data on enrollment, diagnoses, procedures, and prescription medication fills. The proposed study is efficient because it will use gold-standard data on second breast cancer events that have already been abstracted on ~2500 women as part of previously funded studies of breast cancer outcomes. The sample of women will be divided into a training dataset (60%) for algorithm development and a testing dataset (40%) for validation. The primary aim of this study is to develop a """"""""menu"""""""" of algorithms that researchers can select from under different circumstances;i.e., when they want to maximize sensitivity, specificity, or positive predictive value. Secondary analyses will explore: 1) whether algorithms developed in one population are valid in another, and 2) whether valid algorithms can be developed using more limited sources of data that are likely to be available in a larger number of healthcare settings. This project will use innovative approaches to develop the algorithm """"""""menu"""""""" and to explore the generalizability of algorithm development.

Public Health Relevance

As the number of breast cancer survivors grows, research on breast cancer prognosis and quality of life is becoming increasingly important to public health;however, current methods for collecting data on breast cancer recurrences and second primary breast cancers are either time-consuming and costly or have not yet been validated. Being able to identify cancer breast cancer outcomes from automated healthcare data is necessary for conducting large-scale, population-based studies to identify and modify factors that impact the prognosis and quality of life of women with breast cancer.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Exploratory/Developmental Grants (R21)
Project #
Application #
Study Section
Epidemiology of Cancer Study Section (EPIC)
Program Officer
Warren, Joan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Group Health Cooperative
United States
Zip Code
Chubak, Jessica; Onega, Tracy; Zhu, Weiwei et al. (2015) An Electronic Health Record-based Algorithm to Ascertain the Date of Second Breast Cancer Events. Med Care :
Chubak, Jessica; Pocobelli, Gaia; Weiss, Noel S (2012) Tradeoffs between accuracy measures for electronic health care data algorithms. J Clin Epidemiol 65:343-349.e2
Chubak, Jessica; Yu, Onchee; Pocobelli, Gaia et al. (2012) Administrative data algorithms to identify second breast cancer events following early-stage invasive breast cancer. J Natl Cancer Inst 104:931-40