自动变量选择方法

Automated variable selection method

对于这个数据集,我有一个疾病数据集。 disease_rate为因变量,其余独立。

data <- read.csv("H:/uni/MS_DS/disease.csv")
data

> data
         radius      texture perimeter   area smoothness desease_rate
1  -0.018743998  0.002521470 -0.005025 0.0710 0.00000000         0.07
2  -0.027940652  0.003164681 -0.004625 0.0706 0.06476967         0.02
3   0.002615946  0.001328688 -0.005525 0.0726 0.06268457         0.07
4   0.041963329  0.002769471 -0.004325 0.0699 0.06013138         0.06
5   0.030261380  0.005725780 -0.003525 0.0695 0.05942403         0.04
6  -0.030559594  0.001576348 -0.002525 0.0695 0.06110087         0.05
7   0.002698690 -0.003028856 -0.006025 0.0706 0.06207810         0.07
8  -0.044996901  0.000617110 -0.009525 0.0691 0.05940039         0.05
9   0.022993350 -0.000637109 -0.015425 0.0695 0.05870643         0.03
10  0.001398530 -0.000470057 -0.017125 0.0705 0.05540871         0.01
11  0.026827990  0.000509490 -0.014025 0.0681 0.05588225         0.06
12 -0.076220726  0.001018820 -0.010225 0.0631 0.05515852         0.01
13 -0.021917789  0.000822517 -0.003925 0.0576 0.05584590         0.03
14  0.012491060 -0.007363090  0.005175 0.0569 0.05120000         0.03
15  0.038281834 -0.008005798  0.014975 0.0576 0.04940000         0.06
16 -0.033198384  0.000350052  0.022875 0.0564 0.04930000         0.01
17 -0.002358179  0.003846831  0.022675 0.0572 0.05050000         0.07
18  0.020808766  0.000536629  0.024575 0.0656 0.04820000         0.04
19  0.091888897 -0.002393641  0.009775 0.0761 0.04740000         0.07
20 -0.036293550 -0.002889337  0.001775 0.0828 0.04770000         0.01

使用自动方法选择变量

library(leaps)
library(MASS)

model <- regsubsets(desease_rate ~  radius + texture + perimeter + area + smoothness, data = df1, nbest = 1, method = "forward",  
nvmax =4 )

summary(model)

Subset selection object
Call: regsubsets.formula(desease_rate ~ radius + texture + perimeter + 
    area + smoothness, data = df1, nbest = 1, method = "forward", 
    nvmax = 4)
5 Variables  (and intercept)
           Forced in Forced out
radius         FALSE      FALSE
texture        FALSE      FALSE
perimeter      FALSE      FALSE
area           FALSE      FALSE
smoothness     FALSE      FALSE
1 subsets of each size up to 4
Selection Algorithm: forward
         radius texture perimeter area smoothness
1  ( 1 ) "*"    " "     " "       " "  " "       
2  ( 1 ) "*"    " "     " "       " "  "*"       
3  ( 1 ) "*"    "*"     " "       " "  "*"       
4  ( 1 ) "*"    "*"     " "       "*"  "*" 

我不确定在这段代码之后应该做什么。如何自动完成变量选择过程?请帮忙。

第二条评论-部分

> #lm just is an example
> fit <- lm(formula = desease_rate ~ radius + texture + perimeter + area +                 
smoothness, data = df1)

> stepAIC(fit, direction="both")
Start:  AIC=-147.98
desease_rate ~ radius + texture + perimeter + area + smoothness

             Df  Sum of Sq       RSS     AIC
- perimeter   1 0.00000133 0.0067164 -149.98
- area        1 0.00000273 0.0067178 -149.97
- texture     1 0.00027405 0.0069891 -149.18
- smoothness  1 0.00042853 0.0071436 -148.75
<none>                     0.0067151 -147.98
- radius      1 0.00269252 0.0094076 -143.24

倒数第二行 - none

这是一个使用 MASS 库中的 stepAIC 的解决方案

library(MASS)
#lm just is an example
fit <- lm(formula = desease_rate ~ radius + texture + perimeter + area + smoothness, data = data) 
stepAIC(fit, direction="both")

查看 ?stepAIC 了解更多信息。