The Surveillance, Epidemiology and End Results (SEER) Program is a premier source for cancer statistics in the United States. Proper and efficient use of the available resources from the SEER program is of public and national interest. Therefore, we propose innovative methods for estimating 5-year survival probability, identifying important predictors for survival, and estimating the effect of predictor variables on the survival time of cancer patients using the SEER data. In particular, we consider breast cancer survival data as it is the most common type of cancer among women. Modeling survival time in terms of several disease characteristics and demographic factors is challenging due to the censored nature of the data and the presence of many parameters (high- dimensional problem).
In Aim A, we consider an accelerated failure time (AFT) type model, and propose a nonparametric Bayesian solution to this problem. The solution involves modeling mean in terms of many parameters corresponding to the disease characteristics and demographic fac- tors, and modeling variance as a smooth nonparametric function of the mean. The nonparametric error distribution of the AFT model is handled via a constrained Dirichlet process prior. A variable selection technique is adopted to reduce the effective dimension of the problem as the mean involves a large number of parameters. The main innovation is treating the AFT model from such a real and general perspective which no one has done it before. Many of the disease characteristics in the SEER database contain significant proportion of missing values. Ignoring the subjects accompanied with missing values in any disease characteristic may distort the conclusion, and would definitely reduce the power to detect a potential association between the survival time and predictor variables.
In Aim B we propose a semiparametric method of handling a missing predictor variable in the linear transformation model, a semiparametic model which contains the proportional hazard and the proportional odds model as two special cases. The main innovation of this part is how we handle missing data, and make inference about a finite dimensional parameter in the presence of an infinite-dimensional parameter. Finally, our proposed methods permit a useful and accurate interpretation of results of the analysis from modern epidemiological perspective. Our models are broad, and we seek a distribution- free procedure to estimate the model parameters either in the presence of many predictors or in the presence of a missing predictor.

Public Health Relevance

Two innovative methods have been proposed for studying survival of breast cancer patients using the SEER data. The methods will be tools to analyze and understand the effect of prognostic and demographic factors on the survival of patients, which, in turn, will help us understanding the etiology of survival of breast cancer patients.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Small Research Grants (R03)
Project #
Application #
Study Section
Special Emphasis Panel (ZCA1-SRLB-4 (J2))
Program Officer
Mariotto, Angela B
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Texas A&M University
Biostatistics & Other Math Sci
Schools of Arts and Sciences
College Station
United States
Zip Code
Lee, DongHyuk; Carroll, Raymond J; Sinha, Samiran (2017) Frequentist Standard Errors of Bayes Estimators. Comput Stat 32:867-888
Maiti, Tapabrata; Sinha, Samiran; Zhong, Ping-Shou (2016) Functional Mixed Effects Model for Small Area Estimation. Scand Stat Theory Appl 43:886-903
Sinha, Samiran; Ma, Yanyuan (2014) Semiparametric analysis of linear transformation models with covariate measurement errors. Biometrics 70:21-32
Miao, Jingang; Sinha, Samiran; Wang, Suojin et al. (2014) Analysis of Multivariate Disease Classification Data in the Presence of Partially Missing Disease Traits. J Biom Biostat 5:
Sinha, Samiran; Saha, Krishna K; Wang, Suojin (2014) Semiparametric approach for non-monotone missing covariates in a parametric regression model. Biometrics 70:299-311