OLSRR 包在 R 中有很多变量的问题

OLSRR package trouble with many variables in R

我尝试分析我的数据集中的所有变量,以查看哪一组变量最能描述我的因变量 StockPrice。 以下代码是我用来执行此操作的代码:

 install.packages("olsrr")
 library(olsrr)

model <- lm(StockPrice ~ ESGscore + MarketValue + ibc + ni + CommonEquity + AssetsTotal + ROA + ROE + MarketToBook + TobinQ + Liabilities + stock_ret_yr_0 + stock_ret_yr_minus1 + stock_ret_yr_plus1 + EPS + BookValuePS, data = Datensatz_Excel)
ols_step_best_subset(model)
A <- ols_step_best_subset(model)
plot(A)

这里有一些数据可以重现它:

请告诉我这是否适合您,我是第一次这样做。如果除了使用 dput() 之外还有其他更好的方法来提供一些数据(例如清晰排列),请告诉我! :)

structure(list(Company = c("AIR PRODUCTS & CHEMICALS INC", "AIR PRODUCTS & CHEMICALS INC", 
"AIR PRODUCTS & CHEMICALS INC", "AIR PRODUCTS & CHEMICALS INC", 
"AIR PRODUCTS & CHEMICALS INC", "AIR PRODUCTS & CHEMICALS INC", 
"AIR PRODUCTS & CHEMICALS INC"), Year = c(2011, 2012, 2013, 2014, 
2015, 2016, 2017), gvkey = c(1209, 1209, 1209, 1209, 1209, 1209, 
1209), ggroup = c(1510, 1510, 1510, 1510, 1510, 1510, 1510), 
    ESGscore = c(84.2750015258789, 81.9225006103516, 77.4024963378906, 
    80.1125030517578, 78.6449966430664, 76.3775024414062, 79.2699966430664
    ), MarketValue = c(17934.369140625, 17537.578125, 23639.79296875, 
    30868.392578125, 28037.404296875, 31271.359375, 35903.4921875
    ), ibc = c(1252.59997558594, 1025.19995117188, 1042.5, 988.5, 
    1317.59997558594, 1545.69995117188, 1155.19995117188), ni = c(1224.19995117188, 
    1167.30004882812, 994.200012207031, 991.700012207031, 1277.90002441406, 
    631.099975585938, 3000.39990234375), CommonEquity = c(5795.7998046875, 
    6477.2001953125, 7042.10009765625, 7365.7998046875, 7249, 
    7079.60009765625, 10086.2001953125), AssetsTotal = c(14290.7001953125, 
    16941.80078125, 17850.099609375, 17779.099609375, 17438.099609375, 
    18055.30078125, 18467.19921875), ROA = c(0.0906418636441231, 
    0.0816824957728386, 0.0586832538247108, 0.0555571131408215, 
    0.0718765333294868, 0.0361908674240112, 0.166178345680237
    ), ROE = c(0.220699846744537, 0.201404482126236, 0.153492242097855, 
    0.140824466943741, 0.17349100112915, 0.0870602801442146, 
    0.423809230327606), MarketToBook = c(3.09437346458435, 2.70758628845215, 
    3.35692381858826, 4.19077253341675, 3.86776161193848, 4.41710805892944, 
    3.55966472625732), TobinQ = c(1.84940338134766, 1.65284550189972, 
    1.92983758449554, 2.32192254066467, 2.19212555885315, 2.33987021446228, 
    2.39800786972046), Liabilities = c(8494.900390625, 10464.6005859375, 
    10807.9995117188, 10413.2998046875, 10189.099609375, 10975.7006835938, 
    8380.9990234375), StockPrice = c(85.19, 84.02, 111.78, 144.23, 
    130.11, 143.82, 164.08), stock_ret_yr_0 = c(-0.0378783643245697, 
    0.0164456591010094, 0.369286864995956, 0.321167588233948, 
    -0.076192781329155, 0.252576589584351, 0.170138001441956), 
    stock_ret_yr_minus1 = c(0.150884702801704, -0.0378783643245697, 
    0.0164456591010094, 0.369286864995956, 0.321167588233948, 
    -0.076192781329155, 0.252576589584351), stock_ret_yr_plus1 = c(0.0164456591010094, 
    0.369286864995956, 0.321167588233948, -0.076192781329155, 
    0.252576589584351, 0.170138001441956, 0.00247942004352808
    ), EPS = c(5.75, 5.53, 4.74, 4.66, 5.95, 2.92, 13.76), BookValuePS = c(27.21, 
    30.67, 33.58, 34.63, 33.73, 32.72, 46.27)), row.names = c(NA, 
-7L), class = c("tbl_df", "tbl", "data.frame"))

问题是,每当 R 必须分析 16 个不同的变量时,程序就无法运行。 R 在下方框中显示代码并将“模型”放入数据框中,但此后没有任何反应。没有错误消息或类似的东西。我也试过等 15 分钟。但什么也没发生。

如果我只分析 4-5 个变量,完全没有问题。

有人遇到同样的问题,也许有一些解决方案? :)

大家新年快乐,感谢大家的帮助:)

BLUF 或更时髦的 TL;DR 我认为函数 ols_step_best_subset 有局限性。但是,还有其他方法可以得到您想要的东西。

长版:

好的,我使用了您提供的数据,但我没有 运行 解决您 运行 遇到的任何问题。我认为这可能是由于您提供的数据行数太少所致。 (您提供了很多信息!模型可以使用的信息不多。)

我没有向您索取更多数据,因为这似乎更多是关于 R 的局限性的问题,我找到了一个内置的宽数据集。我仍然没有 运行 进入你 运行 进入的问题。

我使用了包 caret 中的数据 dhfr。它有200多个变量;我运行domly选择了24.

我没有清理它。我确实查看了有影响力的变量,这可能会成为一个问题。我这样做是为了寻找多重共线性,这对于线性回归来说是一个非常大的问题。因为这不是问题,所以我使用了这些数据。

library(tidyverse)
library(olsrr)
library(caret) 
library(randomForest)
library(car)

#------------------- Collect and Clean -------------------
data(dhfr, package = "caret")

# arbitrarily chose columns
dhfr2 <- dhfr[, 2:25]

# just checking what's there to work with
summary(dhfr2)

# check for multicollinearity, overly influential
vif(lm(moeGao_Abra_L~., data = dhfr2))
# if error, multicollinearity exists 
#    (multiple variables with the same information)
# high values are likely a source of a big problem in lm()
  # there are several with very high values here

# data isn't clean or explored for actual analysis, 
# but good enough to answer your inquiry

#----------------- Prepare for Modeling ------------------
set.seed(3926)
# partition for training and testing
tr <- createDataPartition(dhfr2[, 1], p = .7, list = F)

#---------------- Linear Regression Model ----------------
# use all remaining variables in dataset (24 predictors)
fit.lm <- lm(moeGao_Abra_L~., data = dhfr2[tr, ])
summary(fit.lm)

模型的解释方差 (R2) 为 .9629。

p <- predict(fit.lm, dhfr2[-tr, -1])
postResample(p, dhfr2[-tr, 1])
#      RMSE  Rsquared       MAE 
# 0.2416366 0.9416441 0.1744155  
# potentially an issue with overfitting
   # assumptions not assessed 

如果您想评估这么大的模型而不是使用 ols_step_best_subset(),您可以 rfe()

您必须通过 caret 创建 lm 模型。

set.seed(3926)
fit.lmT <- train(moeGao_Abra_L~., data = dhfr2[tr, ],
                 method = "lm")

您必须先设置一个控制器,但这实际上只是基于您使用的模型类型。对于 lm,您真的只需要了解 lmFuncslm 函数。

这将使用交叉验证。

ctrl = rfeControl(lmFuncs, 
                  method = "repeatedcv",
                  repeats = 5,
                  verbose = F)

那你就可以申请了rfe().

lmP <- rfe(x = dhfr2[tr, -1],
           y = dhfr2[tr, 1],
           sizes = 4:10,  
           rfeControl = ctrl)

在对 rfe() 的调用中,sizes 参数很重要。来自此功能的帮助:“与应保留的特征数相对应的整数数字向量。”这寻找一组四人,一组五人,一直到十人一组,最适合结果变量。您还可以在这里控制更多的东西。

您可以阅读有关 caret here 的所有详细信息。

rfe()的结果:

# Recursive feature selection
# 
# Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
# 
# Resampling performance over subset size:
# 
#  Variables   RMSE Rsquared    MAE  RMSESD RsquaredSD   MAESD Selected
#          4 0.2814   0.9062 0.2243 0.05158    0.04136 0.04234         
#          5 0.2788   0.9079 0.2192 0.04495    0.03832 0.03776         
#          6 0.2743   0.9106 0.2149 0.04504    0.03794 0.03795         
#          7 0.2670   0.9137 0.2093 0.04296    0.03875 0.03498         
#          8 0.2542   0.9205 0.2008 0.04442    0.03920 0.03437         
#          9 0.2335   0.9331 0.1851 0.04086    0.03124 0.03231         
#         10 0.2203   0.9402 0.1754 0.03177    0.02793 0.02591         
#         23 0.2034   0.9489 0.1618 0.03034    0.02410 0.02428        *
# 
# The top 5 variables (out of 23):
#    moeGao_Abra_R, moe2D_GCUT_PEOE_3, moeGao_Abra_acidity, 
#    moe2D_BCUT_SMR_0, moe2D_BCUT_SMR_3