模型检验
整体模型似然比检验
VGAM::lrtest(model1)
## Likelihood ratio test
## Model 1: ordered(apply) ~ pared + public + gpa
## Model 2: ordered(apply) ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 795 -358.51
## 2 798 -370.60 3 24.18 2.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
模型卡方值为24.18,P<0.05,说明拒绝H0:所有回归系数为0假设,整体模型成立。所得模型public无统计学意义,其他变量有统计学意义。
所得累计logistic回归模型的logit线性形式如下:
logit(p(y <=1|X)) = 2.20335-1.04766pared1-0.61575gpa
logit(p(y <=2|X)) = 4.29879-1.04766pared1-0.61575gpa
对应的概率模型形式:
p(y <=1|X)= exp(2.20335-1.04766pared1-0.61575gpa)/(1+exp(2.20335-1.04766pared1-0.61575gpa))=1/(1+exp(-2.20335+1.04766pared1+0.61575gpa))
p(y <=2|X)= 1/(1+exp(-4.29879+1.04766pared1+0.61575gpa)
对待录取态度3个等级:unlikely、somewhat likely、very likely的可能性P1 = p(y <=1|X),P2 = p(y <=2|X)- p(y <=1|X),P3 = 1-P1-P2=1-p(y <=2|X)
模型预测
通过predict()输出模型预测值,type参数指定输出数据,link为预测值输出,response为预测概率值输出,terms为输出所有变量等。
#预测概率值
pp <- predict(model1,type = "response")
dat2 <- cbind(dat,pp)
#预测分类
dat2$class <- max.col(pp)
head(dat2)
## apply pared public gpa unlikely somewhat likely very likely
## 1 very likely 0 0 3.26 0.5488412 0.3593239 0.09183492
## 2 somewhat likely 1 0 3.21 0.3055774 0.4759460 0.21847662
## 3 unlikely 1 1 3.94 0.2293992 0.4781968 0.29240392
## 4 somewhat likely 0 0 2.81 0.6161122 0.3126959 0.07119192
## 5 somewhat likely 0 0 2.53 0.6559934 0.2834059 0.06060072
## 6 unlikely 0 1 2.59 0.6608795 0.2797448 0.05937562
## class
## 1 1
## 2 2
## 3 2
## 4 1
## 5 1
## 6 1
模型评价
预测分类正确率
with(dat2,table(apply,class))
## class
## apply 1 2
## unlikely 201 19
## somewhat likely 110 30
## very likely 27 13
模型正确判断“unlikely” 的比例为201/220=0.9136,但正确判断“somewhat likely”比例为20/140=0.1429,判断“very likely” 的比例为0 ,说明模型 拟合效果不好。
广义决定系数R^2
由上面程序计算广义R^2,都接近0,远离1,说明模型拟合效果不好。
model1.null<-vglm(apply~ 1, data=dat, family=cumulative(parallel=T))
LLf <- logLik(model1)
LL0 <- logLik(model1.null)
N<-nrow(dat)
#McFadden pseudo-R2, Cox & Snell -R2, Nagelkerke -R2
cbind( 'McFadden R2'= 1 - (LLf / LL0) ,
'Cox & Snell R2'=1 - exp((2/N) * (LL0 - LLf)) ,
'Nagelkerke R2'=(1 - exp((2/N) * (LL0 - LLf))) / (1 - exp(LL0)^(2/N)) )
## McFadden R2 Cox & Snell R2 Nagelkerke R2
## [1,] 0.0326231 0.05866013 0.06956551
MASS包构建有序Logistic回归模型
MASS包无法直接出现P值结果,需要编辑公式计算,Hess=TRUE表示在最优化下,输出观察信息矩阵。
m <- polr(apply ~ pared + public +gpa,data = dat,Hess = TRUE)
#计算p值
ctable <- coef(summary(m))
p <- pnorm(abs(ctable[,"t value"]),lower.tail = FALSE)*2
ctable <- cbind(ctable,"p value" = p)
#CI
(ci <- confint(m))
# default method gives profiled CIs
## 2.5 % 97.5 %
## pared 0.5281768 1.5721750
## public -0.6522060 0.5191384
## gpa 0.1076202 1.1309148
confint.default(m) # CIs assuming normality
## 2.5 % 97.5 %
## pared 0.5267524 1.5686278
## public -0.6425833 0.5250119
## gpa 0.1051074 1.1267737
## OR and CI
exp(cbind(OR = coef(m), ci))
## OR 2.5 % 97.5 %
## pared 2.8510579 1.6958376 4.817114
## public 0.9429088 0.5208954 1.680579
## gpa 1.8513972 1.1136247 3.098490
summary(m)
## Call:
## polr(formula = apply ~ pared + public + gpa, data = dat, Hess = TRUE)
## Coefficients:
## Value Std. Error t value
## pared 1.04769 0.2658 3.9418
## public -0.05879 0.2979 -0.1974
## gpa 0.61594 0.2606 2.3632
## Intercepts:
## Value Std. Error t value
## unlikely|somewhat likely 2.2039 0.7795 2.8272
## somewhat likely|very likely 4.2994 0.8043 5.3453
## Residual Deviance: 717.0249
## AIC: 727.0249
注意的是:MASS包polr函数中,结果的常数项(截距)的计算结果值与vglm()函数的结果完全一致,但是自变量(例如pared1)的回归系数的值1.0476则刚好与vglm()函数的结果值-1.0476符号相反,这是因为两个函数定义的模型形式有差异所致。MASS包的polr()函数的累积logistic模型定义是logit(p(y <=j|X)) = 截距-eta ,采用的是减法 形式,而一般的累积logistic回归,如vglm(),的函数定义形式是logit(p(y <=j|X)) = 截距+eta ,是加法形式,所以两种定义计算出来的符号刚好相反。
变量选择
在R中运用MASS包的step()函数进行逐步回归分析,采用AIC选择自变量。R的逐步法函数step()通过使用选项direction=’backward’、direction=’forward’、direction=’both’分别指定了向后逐步法、向前逐步法和双向逐步法。
运用dropl()函数进行自变量的似然比检验,public无统计学意义,可考虑删除。
selecting <- step(m,direction = "backward")
## Start: AIC=727.02
## apply ~ pared + public + gpa
##
## Df AIC
## - public 1 725.06
## <none> 727.02
## - gpa 1 730.67
## - pared 1 740.60
##
## Step: AIC=725.06
## apply ~ pared + gpa
##
## Df AIC
## <none> 725.06
## - gpa 1 728.79
## - pared 1 738.60
selecting$anova
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 395 717.0249 727.0249
## 2 - public 1 0.03891634 396 717.0638 725.0638
droping <- drop1(m,test = "Chisq")
默认采用向后法。如果是采用向后法必须首先建立包含所有自变量的模型即全模型;如果采用向前法、逐步法除了必须要建立包含所有变量的全模型作为逐步法的终点外,还需建立最小变量模型作为逐步法起点,一般最小模型是不包含自变量的空模型。逐步法的结果输出中引入自变量用符号‘+’,删除自变量用符号‘-’,同时列出对应的AIC变化。
上面结果中,Start: AIC=727.02开始时模型的AIC,此时模型形式是apply ~ pared + public + gpa,模型为全模型,由于引入public后模型的AIC减少,所以首先引入public;下一步AIC=725.06,删除模型内的gpa和pared都使AIC的值增大,所以,逐步过程到此结束。
需要加以说明的是在有序变量赋值时,为与流行病学上对优势比解释保持一致(OR>1为危险因素,OR<1为保护因素),在对应变量赋值时应将专业上最不利的等级赋予最小值,最有利的赋予最大值,如病情从“严重”到“轻”,赋值为1,2,3等。
绘图展示
创建一个新数据集,查看gpa在不同取值时的概率曲线。
newdat <- data.frame(
pared = rep(0:1, 200), public = rep(0:1, each = 200), gpa = rep(seq(from = 1.9, to = 4, length.out = 100), 4))
newdat <- cbind(newdat, predict(m, newdat, type = "probs"))
head(newdat)
## pared public gpa unlikely somewhat likely very likely
## 1 0 0 1.900000 0.7376186 0.2204577 0.04192370
## 2 1 0 1.921212 0.4932185 0.3945673 0.11221424
## 3 0 0 1.942424 0.7325300 0.2244841 0.04298593
## 4 1 0 1.963636 0.4866885 0.3984676 0.11484395
## 5 0 0 1.984848 0.7273792 0.2285470 0.04407383
## 6 1 0 2.006061 0.4801630 0.4023098 0.11752712
#plot
lnewdat <- melt(newdat, id.vars = c("pared", "public", "gpa"), variable.name = "Level", value.name="Probability")
head(lnewdat)
## pared public gpa Level Probability
## 1 0 0 1.900000 unlikely 0.7376186
## 2 1 0 1.921212 unlikely 0.4932185
## 3 0 0 1.942424 unlikely 0.7325300
## 4 1 0 1.963636 unlikely 0.4866885
## 5 0 0 1.984848 unlikely 0.7273792
## 6 1 0 2.006061 unlikely 0.4801630
ggplot(lnewdat, aes(x = gpa, y = Probability, colour = Level)) +
geom_line() + facet_grid(pared ~ public, labeller="label_both")
参考资料