Categorical outcomes are ubiquitous in biomedical research, and generalized linear models (GLMs) represent the most widely applied methodology for testing associations between categorical variables and fixed investigative factors. Logistic regression in particular is the most frequently used model for binary data and has widespread applicability in the health, behavioral, and physical sciences. King and Ryan (2002) stated that there were 2,770 research papers published in 1999 in which """"""""logistic regression"""""""" was in the title of the paper or among the keywords. King and Zeng (2001) referred to the use of the maximum likelihood method in logistic regression as """"""""the nearly universal method"""""""". Maximum likelihood estimates (MLE) for logistic regression are based on large sample approximations that are reliable for problems with large samples and when the proportion of responses is not too small or too large. However, it has been known for several years that MLE are not reliable for small, sparse or unbalanced datasets, with the latter referring to a considerable difference between the number of zeros and ones of the response variable. Recent research has suggested a flexible means of correcting MLE bias and improving performance using a penalized likelihood-based approach, but the underlying theory has not been fully applied and implemented for practical use. In this project, we will extend the work begun during Phase 1 with logistic regression by (1) implementing the bias correction approach for a variety of other GLM's that include Poisson, multinomial, negative binomial, and censored survival data;(2) provide new diagnostic procedures that identify potential problems with near separability and MLE bias;(3) implement and evaluate an exact target estimation approach for bias correction in logistic regression;(4) improve the computational algorithms required for Aims 1-3;and (5) additionally implement the procedures in a SAS PROC. Given the ubiquity of categorical regression in public health and biomedical research, the final product of this effort will provide a critical intermediate alternative when analyzing data for which standard large-sample methods are unreliable and small-sample exact methods are infeasible.

Public Health Relevance

Generalized linear models (such as logistic regression) for categorical data have widespread applicability in the health sciences. Maximum likelihood, the nearly universal method for computing estimates in generalized linear regression models, has been known to have high bias and mean square error for small, sparse or unbalanced datasets. We propose to develop commercial software that incorporates several new methods that have lower bias and mean square error in logistic regression and other generalized linear models and Cox proportional hazard models.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Small Business Innovation Research Grants (SBIR) - Phase II (R44)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-HDM-R (11))
Program Officer
Swain, Amy L
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Cytel, Inc
United States
Zip Code