OLSRR 包在 R 中有很多变量的问题
OLSRR package trouble with many variables in R
我尝试分析我的数据集中的所有变量,以查看哪一组变量最能描述我的因变量 StockPrice。
以下代码是我用来执行此操作的代码:
install.packages("olsrr")
library(olsrr)
model <- lm(StockPrice ~ ESGscore + MarketValue + ibc + ni + CommonEquity + AssetsTotal + ROA + ROE + MarketToBook + TobinQ + Liabilities + stock_ret_yr_0 + stock_ret_yr_minus1 + stock_ret_yr_plus1 + EPS + BookValuePS, data = Datensatz_Excel)
ols_step_best_subset(model)
A <- ols_step_best_subset(model)
plot(A)
这里有一些数据可以重现它:
请告诉我这是否适合您,我是第一次这样做。如果除了使用 dput()
之外还有其他更好的方法来提供一些数据(例如清晰排列),请告诉我! :)
structure(list(Company = c("AIR PRODUCTS & CHEMICALS INC", "AIR PRODUCTS & CHEMICALS INC",
"AIR PRODUCTS & CHEMICALS INC", "AIR PRODUCTS & CHEMICALS INC",
"AIR PRODUCTS & CHEMICALS INC", "AIR PRODUCTS & CHEMICALS INC",
"AIR PRODUCTS & CHEMICALS INC"), Year = c(2011, 2012, 2013, 2014,
2015, 2016, 2017), gvkey = c(1209, 1209, 1209, 1209, 1209, 1209,
1209), ggroup = c(1510, 1510, 1510, 1510, 1510, 1510, 1510),
ESGscore = c(84.2750015258789, 81.9225006103516, 77.4024963378906,
80.1125030517578, 78.6449966430664, 76.3775024414062, 79.2699966430664
), MarketValue = c(17934.369140625, 17537.578125, 23639.79296875,
30868.392578125, 28037.404296875, 31271.359375, 35903.4921875
), ibc = c(1252.59997558594, 1025.19995117188, 1042.5, 988.5,
1317.59997558594, 1545.69995117188, 1155.19995117188), ni = c(1224.19995117188,
1167.30004882812, 994.200012207031, 991.700012207031, 1277.90002441406,
631.099975585938, 3000.39990234375), CommonEquity = c(5795.7998046875,
6477.2001953125, 7042.10009765625, 7365.7998046875, 7249,
7079.60009765625, 10086.2001953125), AssetsTotal = c(14290.7001953125,
16941.80078125, 17850.099609375, 17779.099609375, 17438.099609375,
18055.30078125, 18467.19921875), ROA = c(0.0906418636441231,
0.0816824957728386, 0.0586832538247108, 0.0555571131408215,
0.0718765333294868, 0.0361908674240112, 0.166178345680237
), ROE = c(0.220699846744537, 0.201404482126236, 0.153492242097855,
0.140824466943741, 0.17349100112915, 0.0870602801442146,
0.423809230327606), MarketToBook = c(3.09437346458435, 2.70758628845215,
3.35692381858826, 4.19077253341675, 3.86776161193848, 4.41710805892944,
3.55966472625732), TobinQ = c(1.84940338134766, 1.65284550189972,
1.92983758449554, 2.32192254066467, 2.19212555885315, 2.33987021446228,
2.39800786972046), Liabilities = c(8494.900390625, 10464.6005859375,
10807.9995117188, 10413.2998046875, 10189.099609375, 10975.7006835938,
8380.9990234375), StockPrice = c(85.19, 84.02, 111.78, 144.23,
130.11, 143.82, 164.08), stock_ret_yr_0 = c(-0.0378783643245697,
0.0164456591010094, 0.369286864995956, 0.321167588233948,
-0.076192781329155, 0.252576589584351, 0.170138001441956),
stock_ret_yr_minus1 = c(0.150884702801704, -0.0378783643245697,
0.0164456591010094, 0.369286864995956, 0.321167588233948,
-0.076192781329155, 0.252576589584351), stock_ret_yr_plus1 = c(0.0164456591010094,
0.369286864995956, 0.321167588233948, -0.076192781329155,
0.252576589584351, 0.170138001441956, 0.00247942004352808
), EPS = c(5.75, 5.53, 4.74, 4.66, 5.95, 2.92, 13.76), BookValuePS = c(27.21,
30.67, 33.58, 34.63, 33.73, 32.72, 46.27)), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
问题是,每当 R 必须分析 16 个不同的变量时,程序就无法运行。 R 在下方框中显示代码并将“模型”放入数据框中,但此后没有任何反应。没有错误消息或类似的东西。我也试过等 15 分钟。但什么也没发生。
如果我只分析 4-5 个变量,完全没有问题。
有人遇到同样的问题,也许有一些解决方案? :)
大家新年快乐,感谢大家的帮助:)
BLUF 或更时髦的 TL;DR
我认为函数 ols_step_best_subset
有局限性。但是,还有其他方法可以得到您想要的东西。
长版:
好的,我使用了您提供的数据,但我没有 运行 解决您 运行 遇到的任何问题。我认为这可能是由于您提供的数据行数太少所致。 (您提供了很多信息!模型可以使用的信息不多。)
我没有向您索取更多数据,因为这似乎更多是关于 R 的局限性的问题,我找到了一个内置的宽数据集。我仍然没有 运行 进入你 运行 进入的问题。
我使用了包 caret
中的数据 dhfr
。它有200多个变量;我运行domly选择了24.
我没有清理它。我确实查看了有影响力的变量,这可能会成为一个问题。我这样做是为了寻找多重共线性,这对于线性回归来说是一个非常大的问题。因为这不是问题,所以我使用了这些数据。
library(tidyverse)
library(olsrr)
library(caret)
library(randomForest)
library(car)
#------------------- Collect and Clean -------------------
data(dhfr, package = "caret")
# arbitrarily chose columns
dhfr2 <- dhfr[, 2:25]
# just checking what's there to work with
summary(dhfr2)
# check for multicollinearity, overly influential
vif(lm(moeGao_Abra_L~., data = dhfr2))
# if error, multicollinearity exists
# (multiple variables with the same information)
# high values are likely a source of a big problem in lm()
# there are several with very high values here
# data isn't clean or explored for actual analysis,
# but good enough to answer your inquiry
#----------------- Prepare for Modeling ------------------
set.seed(3926)
# partition for training and testing
tr <- createDataPartition(dhfr2[, 1], p = .7, list = F)
#---------------- Linear Regression Model ----------------
# use all remaining variables in dataset (24 predictors)
fit.lm <- lm(moeGao_Abra_L~., data = dhfr2[tr, ])
summary(fit.lm)
模型的解释方差 (R2) 为 .9629。
p <- predict(fit.lm, dhfr2[-tr, -1])
postResample(p, dhfr2[-tr, 1])
# RMSE Rsquared MAE
# 0.2416366 0.9416441 0.1744155
# potentially an issue with overfitting
# assumptions not assessed
如果您想评估这么大的模型而不是使用 ols_step_best_subset()
,您可以 rfe()
。
您必须通过 caret
创建 lm
模型。
set.seed(3926)
fit.lmT <- train(moeGao_Abra_L~., data = dhfr2[tr, ],
method = "lm")
您必须先设置一个控制器,但这实际上只是基于您使用的模型类型。对于 lm
,您真的只需要了解 lmFuncs
的 lm
函数。
这将使用交叉验证。
ctrl = rfeControl(lmFuncs,
method = "repeatedcv",
repeats = 5,
verbose = F)
那你就可以申请了rfe()
.
lmP <- rfe(x = dhfr2[tr, -1],
y = dhfr2[tr, 1],
sizes = 4:10,
rfeControl = ctrl)
在对 rfe()
的调用中,sizes
参数很重要。来自此功能的帮助:“与应保留的特征数相对应的整数数字向量。”这寻找一组四人,一组五人,一直到十人一组,最适合结果变量。您还可以在这里控制更多的东西。
您可以阅读有关 caret
here 的所有详细信息。
rfe()
的结果:
# Recursive feature selection
#
# Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
#
# Resampling performance over subset size:
#
# Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
# 4 0.2814 0.9062 0.2243 0.05158 0.04136 0.04234
# 5 0.2788 0.9079 0.2192 0.04495 0.03832 0.03776
# 6 0.2743 0.9106 0.2149 0.04504 0.03794 0.03795
# 7 0.2670 0.9137 0.2093 0.04296 0.03875 0.03498
# 8 0.2542 0.9205 0.2008 0.04442 0.03920 0.03437
# 9 0.2335 0.9331 0.1851 0.04086 0.03124 0.03231
# 10 0.2203 0.9402 0.1754 0.03177 0.02793 0.02591
# 23 0.2034 0.9489 0.1618 0.03034 0.02410 0.02428 *
#
# The top 5 variables (out of 23):
# moeGao_Abra_R, moe2D_GCUT_PEOE_3, moeGao_Abra_acidity,
# moe2D_BCUT_SMR_0, moe2D_BCUT_SMR_3
我尝试分析我的数据集中的所有变量,以查看哪一组变量最能描述我的因变量 StockPrice。 以下代码是我用来执行此操作的代码:
install.packages("olsrr")
library(olsrr)
model <- lm(StockPrice ~ ESGscore + MarketValue + ibc + ni + CommonEquity + AssetsTotal + ROA + ROE + MarketToBook + TobinQ + Liabilities + stock_ret_yr_0 + stock_ret_yr_minus1 + stock_ret_yr_plus1 + EPS + BookValuePS, data = Datensatz_Excel)
ols_step_best_subset(model)
A <- ols_step_best_subset(model)
plot(A)
这里有一些数据可以重现它:
请告诉我这是否适合您,我是第一次这样做。如果除了使用 dput()
之外还有其他更好的方法来提供一些数据(例如清晰排列),请告诉我! :)
structure(list(Company = c("AIR PRODUCTS & CHEMICALS INC", "AIR PRODUCTS & CHEMICALS INC",
"AIR PRODUCTS & CHEMICALS INC", "AIR PRODUCTS & CHEMICALS INC",
"AIR PRODUCTS & CHEMICALS INC", "AIR PRODUCTS & CHEMICALS INC",
"AIR PRODUCTS & CHEMICALS INC"), Year = c(2011, 2012, 2013, 2014,
2015, 2016, 2017), gvkey = c(1209, 1209, 1209, 1209, 1209, 1209,
1209), ggroup = c(1510, 1510, 1510, 1510, 1510, 1510, 1510),
ESGscore = c(84.2750015258789, 81.9225006103516, 77.4024963378906,
80.1125030517578, 78.6449966430664, 76.3775024414062, 79.2699966430664
), MarketValue = c(17934.369140625, 17537.578125, 23639.79296875,
30868.392578125, 28037.404296875, 31271.359375, 35903.4921875
), ibc = c(1252.59997558594, 1025.19995117188, 1042.5, 988.5,
1317.59997558594, 1545.69995117188, 1155.19995117188), ni = c(1224.19995117188,
1167.30004882812, 994.200012207031, 991.700012207031, 1277.90002441406,
631.099975585938, 3000.39990234375), CommonEquity = c(5795.7998046875,
6477.2001953125, 7042.10009765625, 7365.7998046875, 7249,
7079.60009765625, 10086.2001953125), AssetsTotal = c(14290.7001953125,
16941.80078125, 17850.099609375, 17779.099609375, 17438.099609375,
18055.30078125, 18467.19921875), ROA = c(0.0906418636441231,
0.0816824957728386, 0.0586832538247108, 0.0555571131408215,
0.0718765333294868, 0.0361908674240112, 0.166178345680237
), ROE = c(0.220699846744537, 0.201404482126236, 0.153492242097855,
0.140824466943741, 0.17349100112915, 0.0870602801442146,
0.423809230327606), MarketToBook = c(3.09437346458435, 2.70758628845215,
3.35692381858826, 4.19077253341675, 3.86776161193848, 4.41710805892944,
3.55966472625732), TobinQ = c(1.84940338134766, 1.65284550189972,
1.92983758449554, 2.32192254066467, 2.19212555885315, 2.33987021446228,
2.39800786972046), Liabilities = c(8494.900390625, 10464.6005859375,
10807.9995117188, 10413.2998046875, 10189.099609375, 10975.7006835938,
8380.9990234375), StockPrice = c(85.19, 84.02, 111.78, 144.23,
130.11, 143.82, 164.08), stock_ret_yr_0 = c(-0.0378783643245697,
0.0164456591010094, 0.369286864995956, 0.321167588233948,
-0.076192781329155, 0.252576589584351, 0.170138001441956),
stock_ret_yr_minus1 = c(0.150884702801704, -0.0378783643245697,
0.0164456591010094, 0.369286864995956, 0.321167588233948,
-0.076192781329155, 0.252576589584351), stock_ret_yr_plus1 = c(0.0164456591010094,
0.369286864995956, 0.321167588233948, -0.076192781329155,
0.252576589584351, 0.170138001441956, 0.00247942004352808
), EPS = c(5.75, 5.53, 4.74, 4.66, 5.95, 2.92, 13.76), BookValuePS = c(27.21,
30.67, 33.58, 34.63, 33.73, 32.72, 46.27)), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
问题是,每当 R 必须分析 16 个不同的变量时,程序就无法运行。 R 在下方框中显示代码并将“模型”放入数据框中,但此后没有任何反应。没有错误消息或类似的东西。我也试过等 15 分钟。但什么也没发生。
如果我只分析 4-5 个变量,完全没有问题。
有人遇到同样的问题,也许有一些解决方案? :)
大家新年快乐,感谢大家的帮助:)
BLUF 或更时髦的 TL;DR
我认为函数 ols_step_best_subset
有局限性。但是,还有其他方法可以得到您想要的东西。
长版:
好的,我使用了您提供的数据,但我没有 运行 解决您 运行 遇到的任何问题。我认为这可能是由于您提供的数据行数太少所致。 (您提供了很多信息!模型可以使用的信息不多。)
我没有向您索取更多数据,因为这似乎更多是关于 R 的局限性的问题,我找到了一个内置的宽数据集。我仍然没有 运行 进入你 运行 进入的问题。
我使用了包 caret
中的数据 dhfr
。它有200多个变量;我运行domly选择了24.
我没有清理它。我确实查看了有影响力的变量,这可能会成为一个问题。我这样做是为了寻找多重共线性,这对于线性回归来说是一个非常大的问题。因为这不是问题,所以我使用了这些数据。
library(tidyverse)
library(olsrr)
library(caret)
library(randomForest)
library(car)
#------------------- Collect and Clean -------------------
data(dhfr, package = "caret")
# arbitrarily chose columns
dhfr2 <- dhfr[, 2:25]
# just checking what's there to work with
summary(dhfr2)
# check for multicollinearity, overly influential
vif(lm(moeGao_Abra_L~., data = dhfr2))
# if error, multicollinearity exists
# (multiple variables with the same information)
# high values are likely a source of a big problem in lm()
# there are several with very high values here
# data isn't clean or explored for actual analysis,
# but good enough to answer your inquiry
#----------------- Prepare for Modeling ------------------
set.seed(3926)
# partition for training and testing
tr <- createDataPartition(dhfr2[, 1], p = .7, list = F)
#---------------- Linear Regression Model ----------------
# use all remaining variables in dataset (24 predictors)
fit.lm <- lm(moeGao_Abra_L~., data = dhfr2[tr, ])
summary(fit.lm)
模型的解释方差 (R2) 为 .9629。
p <- predict(fit.lm, dhfr2[-tr, -1])
postResample(p, dhfr2[-tr, 1])
# RMSE Rsquared MAE
# 0.2416366 0.9416441 0.1744155
# potentially an issue with overfitting
# assumptions not assessed
如果您想评估这么大的模型而不是使用 ols_step_best_subset()
,您可以 rfe()
。
您必须通过 caret
创建 lm
模型。
set.seed(3926)
fit.lmT <- train(moeGao_Abra_L~., data = dhfr2[tr, ],
method = "lm")
您必须先设置一个控制器,但这实际上只是基于您使用的模型类型。对于 lm
,您真的只需要了解 lmFuncs
的 lm
函数。
这将使用交叉验证。
ctrl = rfeControl(lmFuncs,
method = "repeatedcv",
repeats = 5,
verbose = F)
那你就可以申请了rfe()
.
lmP <- rfe(x = dhfr2[tr, -1],
y = dhfr2[tr, 1],
sizes = 4:10,
rfeControl = ctrl)
在对 rfe()
的调用中,sizes
参数很重要。来自此功能的帮助:“与应保留的特征数相对应的整数数字向量。”这寻找一组四人,一组五人,一直到十人一组,最适合结果变量。您还可以在这里控制更多的东西。
您可以阅读有关 caret
here 的所有详细信息。
rfe()
的结果:
# Recursive feature selection
#
# Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
#
# Resampling performance over subset size:
#
# Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
# 4 0.2814 0.9062 0.2243 0.05158 0.04136 0.04234
# 5 0.2788 0.9079 0.2192 0.04495 0.03832 0.03776
# 6 0.2743 0.9106 0.2149 0.04504 0.03794 0.03795
# 7 0.2670 0.9137 0.2093 0.04296 0.03875 0.03498
# 8 0.2542 0.9205 0.2008 0.04442 0.03920 0.03437
# 9 0.2335 0.9331 0.1851 0.04086 0.03124 0.03231
# 10 0.2203 0.9402 0.1754 0.03177 0.02793 0.02591
# 23 0.2034 0.9489 0.1618 0.03034 0.02410 0.02428 *
#
# The top 5 variables (out of 23):
# moeGao_Abra_R, moe2D_GCUT_PEOE_3, moeGao_Abra_acidity,
# moe2D_BCUT_SMR_0, moe2D_BCUT_SMR_3