Case study background and problem formulations
——————————————————————–
Estimating the probability of Cesarean Section and the probability of Cephalopelvic Disproportion/Failure to Progress (CPD): Chen, G.,Uryasev, S. and T. Young. On Prediction of the Cesarean Delivery Risk in a Large Private Practice, American Journal of Obstetrics and Gynecologists,191/2, 2004, 624-632
——————————————————————–
maximize logexp_sum (maximizing log-likelihood)
Value: logistic
——————————————————————–
logexp_sum = log-likelihood function for logistic regression (Logarithms Exponents Sum)
logistic = Logistic calculates values of logistic function for every observation (scenario)
——————————————————————–
Data and solution in Run-File Environment
Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.14GHz (sec) | |||
---|---|---|---|---|---|---|---|
Dataset1 | Problem Statement | Data | Solution | 6 | 12,690 | -0.495793 | 0.08 |
Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.50GHz (sec) | |||
---|---|---|---|---|---|---|---|
Dataset1 | Matlab code | Data | Solution | 6 | 12,690 | -0.495793 | 0.05 |
Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.50GHz (sec) | |||
---|---|---|---|---|---|---|---|
Dataset1 | R code | Data | 6 | 12,690 | -0.495793 | 0.05 |
Problem 2: maximizing regularized log-likelihood
maximize logexp_sum – polynom_abs (maximizing regularized log-likelihood)
Value: logistic
——————————————————————–
logexp_sum = log-likelihood function for logistic regression (Logarithms Exponents Sum)
polynom_abs = Polynomial Absolute
logistic = Logistic calculates values of logistic function for every observation (scenario)
——————————————————————–
Data and solution in Run-File Environment
Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.14GHz (sec) | |||
---|---|---|---|---|---|---|---|
Dataset1 | Problem Statement | Data | Solution | 6 | 12,690 | -0.498204 | 0.05 |
Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.50GHz (sec) | |||
---|---|---|---|---|---|---|---|
Dataset1 | Matlab code | Data | Solution | 6 | 12,690 | -0.496348 | 0.04 |
Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.50GHz (sec) | |||
---|---|---|---|---|---|---|---|
Dataset1 | R code | Data | 6 | 12,690 | -0.496348 | 0.04 |
Problem 3: maximizing log-likelihood under cardinality constraint
maximize logexp_sum (maximizing log-likelihood)
Constraint: <= 4
cardn
Solver: precision = 9
Value: logistic
——————————————————————–
logexp_sum = log-likelihood function for logistic regression (Logarithms Exponents Sum)
cardn = Cardinality
logistic = Logistic calculates values of logistic function for every observation (scenario)
——————————————————————–
Data and solution in Run-File Environment
Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.14GHz (sec) | |||
---|---|---|---|---|---|---|---|
Dataset1 | Problem Statement | Data | Solution | 6 | 12,690 | -0.497135 | 0.35 |
Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.50GHz (sec) | |||
---|---|---|---|---|---|---|---|
Dataset1 | Matlab code | Data | Solution | 6 | 12,690 | -0.497134 | <0.1 |
Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.50GHz (sec) | |||
---|---|---|---|---|---|---|---|
Dataset1 | R code | Data | 6 | 12,690 | -0.497134 | <0.1 |
Problem 4: 4-fold Cross-validation (4 in-sample data and 4 out-of-sample data) for maximization of the log-likelihood function
4-fold crossvalidation
Maximize logexp_sum
Value:
logistic (function Logistic on the in-sample data)
logistic (function Logistic on the out-of-sample data)
——————————————————————–
crossvalidation(N,Matrix) = matrix operation splits input Matrix into N pairs of complementary sub-matrices
logexp_sum = log-likelihood function for logistic regression (Logarithms Exponents Sum)
logistic = Logistic calculates values of logistic function for every observation (scenario)
——————————————————————–
Data and solution in Run-File Environment
Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.14GHz (sec) | |||
---|---|---|---|---|---|---|---|
Dataset1 | Cycle statement | Data | Solution | 6 | 9,517 | -0.496 | 0.15 |
Dataset2 | 6 | 9,517 | -0.495 | 0.18 | |||
Dataset3 | 6 | 9,517 | -0.498 | 0.05 | |||
Dataset4 | 6 | 9,517 | -0.494 | 0.08 |
Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.50GHz (sec) | |||
---|---|---|---|---|---|---|---|
Dataset1 | Matlab code | Data | Solution | 6 | 9,517 | -0.496 | 0.11 | Dataset2 | 6 | 9,517 | -0.495 | 0.14 |
Dataset3 | 6 | 9,517 | -0.498 | 0.05 | |||
Dataset4 | 6 | 9,517 | -0.494 | 0.07 |
Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.50GHz (sec) | |||
---|---|---|---|---|---|---|---|
Dataset1 | R code | Data | 6 | 9,517 | -0.496 | 0.11 | Dataset2 | 6 | 9,517 | -0.495 | 0.14 |
Dataset3 | 6 | 9,517 | -0.498 | 0.05 | |||
Dataset4 | 6 | 9,517 | -0.494 | 0.07 |
CASE STUDY SUMMARY
This case study finds an optimal estimate of the cesarean section rate in a women population. The risk of difficult labor is described by a probabilistic model that depends on measurable demographic factors. We evaluated the effects of demographic factors on the probability of Cesarean section. This case study considers 6 primary factors: age, height, weight, maternal weight gain, gestational age, and birth weight. Background for this case study is described in Chen et al. (2004).
We considered four formulations of the logistic regression optimization problem:
• Problem 1. Maximization of the log-likelihood function (“plain vanilla” logistic regression).
• Problem 2. Maximization of the log-likelihood function minus additional regularization term (regularized logistic regression).
• Problem 3. Maximization of the log-likelihood function subject to constraint on cardinality.
• Problem 4. Cross-Validation applied to Maximization of the log-likelihood function.
Problem 1 was implemented in PSG by maximizing the log-likelihood function which is a standard PSG function (“logexp_sum”). This problem formulation was considered in Chen et al (2004).
The regularization term in Problem 2 was subtracted from the log-likelihood function to improve the out-of-sample performance of the regression model. The regularization is very popular in data-mining applications, see for instance, Shi et al (2008). For regularization we used the “polynom_abs” function, which is a standard function of PSG. Coefficients for this polynomial absolute function were obtained with the steepest descent algorithm which optimizes out-of-sample performance.
The constraint on cardinality in the Problem 3 was used to reduce the number of factors and improve the out-of-sample performance of the regression model.
Problem 4 is the 4-fold Cross-Validation for the Maximization of the log-likelihood (which was done in Problem 1). In each pass we selected ¾ of the data as in-sample dataset on which we calibrated the model. Then we tested the performance of the models on the remaining (out-of-sample) ¼ part of data to observe how the model predicts the probability of Cesarean section.
• Chen, G., Uryasev, S., and T.K. Young (2004): On the prediction of the cesarean delivery risk in a large private practice. American Journal of Obstetrics and Gynecology, 191, 617-25.
• Shi W., Wahba, G., Wright S, Lee, K., Klein, R, Klein, B. (2008): LASSO-Patternsearch algorithm with application to ophthalmology and genomic data. Stat Interface., 1(1), 137-153.