R 中插补后的逻辑回归

Logistic regression after imputation in R

我尝试在 R 中使用 glm 对 winconsin 乳腺癌数据集实施逻辑回归。我分析了数据集,发现 wbc$V7 包含缺失值。我使用 Hmisc 包估算缺失值并使用 glm

执行逻辑回归
wbc=read.csv(file="https://archive.ics.uci.edu/ml/machine-learning- 
databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",header = 
FALSE)
wbc[wbc=='?']=NA  #replacing '?' with NA
a=sapply(wbc,function(x) sum(is.na(x))) #analyse the number of NA in each column
print(a)
library(Hmisc)
wbc$V7=impute(wbc$V7,mode)  #impute missing values with mode in V7
wbc$V11[wbc$V11==2]=0; #V11 has either '2' or '4' as entries, replacing '2' with '0' and '4' with '1' 
wbc$V11[wbc$V11==4]=1;
model <- glm(V11~V2+V3+V4+V5+V6+V7+V8+V9+V10,family=binomial(),data=wbc) #

OUTPUT:


Call:  glm(formula = V11 ~ V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10, 
family = binomial(), data = wbc)

Coefficients:
(Intercept)           V2           V3           V4           V5           V6          
V71         V710  
8.6625       0.4511      -0.1013       0.4842       0.2206       0.1684     
-18.7466     -14.8168  
V72          V73          V74          V75          V76          V77          
V78          V79  
-17.6684     -16.0272     -15.3552     -16.3765       0.7704     -16.2944     
-16.6171           NA  
V8           V9          V10  
0.5052       0.1144       0.4550  

Degrees of Freedom: 698 Total (i.e. Null);  681 Residual
Null Deviance:      900.5 
Residual Deviance: 102.9    AIC: 138.9

当 wbc 数据帧只有 V1、V2、V3、V4、V5、V6 列时,为什么输出包含 V71、V710、V72、V73、V74、V75、V76、V77、V78 和 V79 的系数, V7、V8、V9、V10 ?

如果 V7 是一个因素,则在应用 glm 时可能会自动进行伪编码。那么你的因子的每个类别都有一个系数。

您应该将变量 v7 更改为数字,它现在是因子,因此您将获得 V7 列中所有值的结果。将其更改为数字将解决您的问题。

希望对您有所帮助