R Quantreg:具有分类调查数据的奇点
R Quantreg: Singularity with categorical survey data
对于我的学士论文,我正在尝试对来自调查 (see formula from A.Blass (2008)) 的常数和数据应用线性中值回归模型。它试图重新创建 A. Blass 等人 (2008) 提出的概率启发方法 - 使用启发选择概率估计随机效用模型:电力可靠性偏好
我的因变量是 常数和分配 的对数比 t运行 形成。使用以下公式计算:
PE_raw <- PE_raw %>% group_by(sys_RespNum, Task) %>% mutate(LogProb = c(log(Response[1]/Response[1]),
log(Response[2]/Response[1]),
log(Response[3]/Response[1])))
我的自变量是运费、最小订货量和运费window,每个分类变量的级别为 0、1、2 和 3。这里,级别 0 表示 none-选项。
Data snapshot
我尝试了运行以下分位数回归(使用 R 的 quantreg 包):
LAD.factor <- rq(LogProb ~ factor(`Delivery costs`) + factor(`Minimum order quantity`) + factor(`Delivery window`) + factor(NoneOpt), data=PE_raw, tau=0.5)
然而,我运行进入以下指示奇点的错误:
Error in rq.fit.br(x, y, tau = tau, ...) : Singular design matrix
I 运行 线性回归并应用 R 的别名函数进行进一步调查。这告诉我三种完全多重共线性的情况:
- 最小起订量3=运费1+运费2+运费3-起订量1-起订量2
- 运费window3=运费1+运费2+运费3-运费window1-运费window2
- NoneOpt = 拦截 - 运费 1 - 运费 2 - 运费 3
事后看来,这些案例都是有道理的。当 R 对分类变量进行二分法时,您可以通过构建得到这些结果,运费 1 + 运费 2 + 运费 3 = 1 和最小订货量 1 + 最小订货量 2 + 最小订货量 3 = 1。重写给出第一个公式。
它看起来像一个经典的虚拟陷阱。为了解决这个问题,我尝试手动对数据进行二分法并使用以下公式:
LM.factor <- rq(LogProb ~ Delivery.costs_1 + Delivery.costs_2 + Minimum.order.quantity_1 + Minimum.order.quantity_2 + Delivery.window_1 + Delivery.window_2 + factor(NoneOpt), data=PE_dichomitzed, tau=0.5)
我现在得到的不是错误消息:
Warning message:
In rq.fit.br(x, y, tau = tau, ...) : Solution may be nonunique
使用汇总功能时:
> summary(LM.factor)
Error in base::backsolve(r, x, k = k, upper.tri = upper.tri, transpose = transpose, :
singular matrix in 'backsolve'. First zero in diagonal [2]
In addition: Warning message:
In summary.rq(LM.factor) : 153 non-positive fis
有人熟悉这个问题吗?我正在寻找替代解决方案。也许我在使用 rq() 函数时犯了错误,或者数据可能被歪曲了。
非常感谢您的任何意见,在此先感谢您。
可重现的例子
library(quantreg)
#### Raw dataset (PE_raw_SO) ####
# quantile regression (produces singularity error)
LAD.factor <- rq(
LogProb ~ factor(`Delivery costs`) +
factor(`Minimum order quantity`) + factor(`Delivery window`) +
factor(NoneOpt),
data = PE_raw_SO,
tau = 0.5
)
# linear regression to check for singularity
LM.factor <- lm(
LogProb ~ factor(`Delivery costs`) +
factor(`Minimum order quantity`) + factor(`Delivery window`) +
factor(NoneOpt),
data = PE_raw_SO
)
alias(LM.factor)
# impose assumptions on standard errors
summary(LM.factor, se = "iid")
summary(LM.factor, se = "boot")
#### Manually created dummy variables to get rid of
#### collinearity (PE_dichotomized_SO) ####
LAD.di.factor <- rq(
LogProb ~ Delivery.costs_1 + Delivery.costs_2 +
Minimum.order.quantity_1 + Minimum.order.quantity_2 +
Delivery.window_1 + Delivery.window_2 + factor(NoneOpt),
data = PE_dichotomized_SO,
tau = 0.5
)
summary(LAD.di.factor) #backsolve error
# impose assumptions (unusual results)
summary(LAD.di.factor, se = "iid")
summary(LAD.di.factor, se = "boot")
# linear regression to check for singularity
LM.di.factor <- lm(
LogProb ~ Delivery.costs_1 + Delivery.costs_2 +
Minimum.order.quantity_1 + Minimum.order.quantity_2 +
Delivery.window_1 + Delivery.window_2 + factor(NoneOpt),
data = PE_dichotomized_SO
)
alias(LM.di.factor)
summary(LM.di.factor) #regular results, all significant
Link 样本数据+代码:GitHub
使用虚拟解释变量进行分位数回归时,Solution may be nonunique
行为并不罕见。
请参见 quantreg FAQ:
The estimation of regression quantiles is a linear programming
problem. And the optimal solution may not be unique.
Roger Koenker(quantreg 的作者)在 r-help back in 2006 上对正在发生的事情给出了更直观的解释:
When computing the median from a sample with an even number of
distinct values there is inherently some ambiguity about its value:
any value between the middle order statistics is "a" median.
Similarly, in regression settings the optimization problem solved by
the "br" version of the simplex algorithm, modified to do general
quantile regression identifies cases where there may be non
uniqueness of this type. When there are "continuous" covariates this
is quite rare, when covariates are discrete then it is relatively
common, atleast when tau is chosen from the rationals. For univariate
quantiles R provides several methods of resolving this sort of
ambiguity by interpolation, "br" doesn't try to do this, instead
returning the first vertex solution that it comes to.
你的第二个警告——“153 non-positive fis”——是一个与 rq 计算局部密度的方式有关的警告。有时,分位数回归函数的局部密度可能最终为负(这显然是不可能的)。如果发生这种情况,rq 会自动将它们设置为零。再次引用 FAQ:
This is generally harmless, leading to a somewhat conservative
(larger) estimate of the standard errors, however if the reported
number of non-positive fis is large relative to the sample size then
it is an indication of misspecification of the model.
对于我的学士论文,我正在尝试对来自调查 (see formula from A.Blass (2008)) 的常数和数据应用线性中值回归模型。它试图重新创建 A. Blass 等人 (2008) 提出的概率启发方法 - 使用启发选择概率估计随机效用模型:电力可靠性偏好
我的因变量是 常数和分配 的对数比 t运行 形成。使用以下公式计算:
PE_raw <- PE_raw %>% group_by(sys_RespNum, Task) %>% mutate(LogProb = c(log(Response[1]/Response[1]),
log(Response[2]/Response[1]),
log(Response[3]/Response[1])))
我的自变量是运费、最小订货量和运费window,每个分类变量的级别为 0、1、2 和 3。这里,级别 0 表示 none-选项。
Data snapshot
我尝试了运行以下分位数回归(使用 R 的 quantreg 包):
LAD.factor <- rq(LogProb ~ factor(`Delivery costs`) + factor(`Minimum order quantity`) + factor(`Delivery window`) + factor(NoneOpt), data=PE_raw, tau=0.5)
然而,我运行进入以下指示奇点的错误:
Error in rq.fit.br(x, y, tau = tau, ...) : Singular design matrix
I 运行 线性回归并应用 R 的别名函数进行进一步调查。这告诉我三种完全多重共线性的情况:
- 最小起订量3=运费1+运费2+运费3-起订量1-起订量2
- 运费window3=运费1+运费2+运费3-运费window1-运费window2
- NoneOpt = 拦截 - 运费 1 - 运费 2 - 运费 3
事后看来,这些案例都是有道理的。当 R 对分类变量进行二分法时,您可以通过构建得到这些结果,运费 1 + 运费 2 + 运费 3 = 1 和最小订货量 1 + 最小订货量 2 + 最小订货量 3 = 1。重写给出第一个公式。
它看起来像一个经典的虚拟陷阱。为了解决这个问题,我尝试手动对数据进行二分法并使用以下公式:
LM.factor <- rq(LogProb ~ Delivery.costs_1 + Delivery.costs_2 + Minimum.order.quantity_1 + Minimum.order.quantity_2 + Delivery.window_1 + Delivery.window_2 + factor(NoneOpt), data=PE_dichomitzed, tau=0.5)
我现在得到的不是错误消息:
Warning message:
In rq.fit.br(x, y, tau = tau, ...) : Solution may be nonunique
使用汇总功能时:
> summary(LM.factor)
Error in base::backsolve(r, x, k = k, upper.tri = upper.tri, transpose = transpose, :
singular matrix in 'backsolve'. First zero in diagonal [2]
In addition: Warning message:
In summary.rq(LM.factor) : 153 non-positive fis
有人熟悉这个问题吗?我正在寻找替代解决方案。也许我在使用 rq() 函数时犯了错误,或者数据可能被歪曲了。
非常感谢您的任何意见,在此先感谢您。
可重现的例子
library(quantreg)
#### Raw dataset (PE_raw_SO) ####
# quantile regression (produces singularity error)
LAD.factor <- rq(
LogProb ~ factor(`Delivery costs`) +
factor(`Minimum order quantity`) + factor(`Delivery window`) +
factor(NoneOpt),
data = PE_raw_SO,
tau = 0.5
)
# linear regression to check for singularity
LM.factor <- lm(
LogProb ~ factor(`Delivery costs`) +
factor(`Minimum order quantity`) + factor(`Delivery window`) +
factor(NoneOpt),
data = PE_raw_SO
)
alias(LM.factor)
# impose assumptions on standard errors
summary(LM.factor, se = "iid")
summary(LM.factor, se = "boot")
#### Manually created dummy variables to get rid of
#### collinearity (PE_dichotomized_SO) ####
LAD.di.factor <- rq(
LogProb ~ Delivery.costs_1 + Delivery.costs_2 +
Minimum.order.quantity_1 + Minimum.order.quantity_2 +
Delivery.window_1 + Delivery.window_2 + factor(NoneOpt),
data = PE_dichotomized_SO,
tau = 0.5
)
summary(LAD.di.factor) #backsolve error
# impose assumptions (unusual results)
summary(LAD.di.factor, se = "iid")
summary(LAD.di.factor, se = "boot")
# linear regression to check for singularity
LM.di.factor <- lm(
LogProb ~ Delivery.costs_1 + Delivery.costs_2 +
Minimum.order.quantity_1 + Minimum.order.quantity_2 +
Delivery.window_1 + Delivery.window_2 + factor(NoneOpt),
data = PE_dichotomized_SO
)
alias(LM.di.factor)
summary(LM.di.factor) #regular results, all significant
Link 样本数据+代码:GitHub
使用虚拟解释变量进行分位数回归时,Solution may be nonunique
行为并不罕见。
请参见 quantreg FAQ:
The estimation of regression quantiles is a linear programming problem. And the optimal solution may not be unique.
Roger Koenker(quantreg 的作者)在 r-help back in 2006 上对正在发生的事情给出了更直观的解释:
When computing the median from a sample with an even number of distinct values there is inherently some ambiguity about its value: any value between the middle order statistics is "a" median. Similarly, in regression settings the optimization problem solved by the "br" version of the simplex algorithm, modified to do general quantile regression identifies cases where there may be non uniqueness of this type. When there are "continuous" covariates this is quite rare, when covariates are discrete then it is relatively common, atleast when tau is chosen from the rationals. For univariate quantiles R provides several methods of resolving this sort of ambiguity by interpolation, "br" doesn't try to do this, instead returning the first vertex solution that it comes to.
你的第二个警告——“153 non-positive fis”——是一个与 rq 计算局部密度的方式有关的警告。有时,分位数回归函数的局部密度可能最终为负(这显然是不可能的)。如果发生这种情况,rq 会自动将它们设置为零。再次引用 FAQ:
This is generally harmless, leading to a somewhat conservative (larger) estimate of the standard errors, however if the reported number of non-positive fis is large relative to the sample size then it is an indication of misspecification of the model.