使用 R 根据 VIF 标准自动从数据框中删除变量

Automatically removing variables from dataframe based on VIF criteria using R

我有一系列数据框,每个数据框代表一个线性模型。我想根据 VIF 标准的阈值 10 自动从每个数据框中删除列。给定的数据框如下所示:

df_nn <- structure(list(capital = c(100, 101, 102, 103, 
104, 105, 106, 107, 108, 109, 
110, 111, 112, 113, 114, 115, 
116, 117, 118, 119, 120, 121, 
122, 123, 124, 125, 126, 127, 
128, 129, 130, 131, 132), IVAE = c(109.19, 
110.09, 111.84, 112.49, 111.99, 113.11, 111.89, 112.11, 112.75, 
113.7, 112.93, 112.43, 114.88, 114.5, 114.93, 115.13, 105.54, 
91.71, 87.93, 93.06, 96.74, 103.26, 106.76, 109.6, 110.74, 112, 
112.73, 114.97, 115.01, 114.67, 115.78, 114.52, 111.91), `Índice de Producción Industrial (IPI): Industrias Manufactureras, Explotación de Minas y Canteras y Otras Actividades Industriales` = c(101.4, 
103.4, 106.72, 108.45, 107.76, 107.25, 105.75, 107.03, 107.31, 
106.61, 106.95, 106.61, 110.18, 108.68, 109.66, 111.32, 100.02, 
76.77, 73.46, 81.99, 94.83, 100.64, 104.51, 106.74, 107.04, 108.75, 
110.8, 110.59, 111.25, 108.82, 110.03, 111.32, 107.61), Construcción = c(112.25, 
117.5, 124.32, 122.64, 121.21, 128.69, 122.28, 126.55, 120.13, 
137.47, 129.82, 126.83, 132.92, 131.72, 137.56, 130.89, 117.08, 
87.62, 67.49, 79.56, 88.97, 117.57, 110.01, 118.02, 117.61, 121.64, 
120.76, 120.99, 118.96, 122.7, 122.59, 101.2, 106.3), `Comercio, Transporte y Almacenamiento, Actividades de Alojamiento y de Servicio de Comidas` = c(112.2, 
113.03, 115.69, 113.74, 114.7, 115.93, 115.3, 114.25, 115.05, 
116.68, 114.84, 114.56, 116.58, 117.77, 119.19, 119.15, 103.41, 
76.66, 75.21, 90.32, 91.72, 97.53, 105.21, 110.43, 109.72, 112.41, 
114.05, 115.88, 117.29, 115.05, 114.69, 116.79, 109.68), `Actividades Inmobiliarias` = c(113.31, 
113.83, 114.69, 114.97, 115.98, 116.2, 116.22, 115.64, 115.79, 
115.95, 116.24, 117.6, 117.84, 115.35, 108.98, 105.89, 103.74, 
103.16, 102.5, 102.42, 102.41, 104.16, 107.74, 112.87, 116.57, 
115.68, 113.47, 112.41, 112.08, 112.42, 112.74, 113.21, 112.56
), `Actividades Profesionales, Científicas, Técnicas, Administrativas, de Apoyo y Otros Servicios` = c(111.84, 
111.92, 116.44, 117.77, 112.96, 114.64, 113.67, 112.33, 115.12, 
113.31, 114.14, 115.46, 117.17, 120.57, 124.26, 122.68, 99.51, 
86.36, 79.21, 81.56, 83.6, 88.71, 97.76, 98.16, 101.04, 102.68, 
108.37, 113.64, 114.82, 115.91, 118.35, 118.74, 109.14), empleo = c(851413, 
856079, 853309, 854541, 856040, 853881, 853328, 858454, 860200, 
861430, 865033, 867569, 874276, 870793, 872645, 876928, 873733, 
840029, 813159, 805474, 808920, 814118, 824284, 833293, 841311, 
842072, 848832, 854290, 859130, 860833, 865704, 873081, 881033
)), row.names = c(NA, -33L), class = c("tbl_df", "tbl", "data.frame"
))

其中“资本”是因变量,其余列是自变量,都是数字。

到目前为止,我已经为单个数据框尝试了以下函数:

library(car)

vif_fun <- function(df){
             while(TRUE) {
                vifs <- vif(lm(capital ~. , data = df))
                if (max(vifs) < 10) {
                     break
                }
               highest <- c(names((which(vifs == max(vifs)))))
               df <- df[,-which(names(df) %in% highest)]

              }
            return(df)
              }

vif_fun(df_nn)

只要有VIF大于10的变量,函数应该:

但是,每当我 运行 函数时,我都会收到以下错误消息:

Error in terms.formula(formula, data = data) : 
'.' in formula and no 'data' argument

我用 mtcars 数据集尝试了该函数,将函数中的“mpg”替换为“capital”,结果成功了。对可能发生的事情有什么想法吗?

问题是您的 data.frame 中有非标准名称(某些列包含空格)。这会导致问题,因为 vif() 返回的对象名称不再与列名完全匹配。 vif 函数将非标准列名包装在反引号中,但这些反引号实际上并不是 data.frame 中列名的一部分。您可以在进行匹配时删除这些刻度,例如:

vif_fun <- function(df){
  untick <- function(x) gsub("^`|`$", "", x)
  while(TRUE) {
    vifs <- vif(lm(capital ~. , data = df))
    if (max(vifs) < 10) {
      break
    }
    highest <- untick(names((which(vifs == max(vifs)))))
    df <- df[,-which(names(df) %in% highest)]
    
  }
  return(df)
}

一个更简单的选择是使用 janitor 中的 clean_names,它确实替换了非特定列名称

vif_fun <- function(df){
             df <- janitor::clean_names(df)
             while(TRUE) {
                vifs <- vif(lm(capital ~. , data = df))
                if (max(vifs) < 10) {
                     break
                }
               highest <- c(names((which(vifs == max(vifs)))))
               df <- df[,-which(names(df) %in% highest)]

              }
            return(df)
              }

vif_fun(df_nn)