R Quantreg:具有分类调查数据的奇点

R Quantreg: Singularity with categorical survey data

对于我的学士论文,我正在尝试对来自调查 (see formula from A.Blass (2008)) 的常数和数据应用线性中值回归模型。它试图重新创建 A. Blass 等人 (2008) 提出的概率启发方法 - 使用启发选择概率估计随机效用模型:电力可靠性偏好

我的因变量是 常数和分配 的对数比 t运行 形成。使用以下公式计算:

PE_raw <- PE_raw %>% group_by(sys_RespNum, Task) %>% mutate(LogProb = c(log(Response[1]/Response[1]),
                                                         log(Response[2]/Response[1]),
                                                         log(Response[3]/Response[1])))

我的自变量是运费最小订货量运费window,每个分类变量的级别为 0、1、2 和 3。这里,级别 0 表示 none-选项。

Data snapshot

我尝试了运行以下分位数回归(使用 R 的 quantreg 包):

LAD.factor <- rq(LogProb ~ factor(`Delivery costs`) + factor(`Minimum order quantity`) + factor(`Delivery window`) + factor(NoneOpt), data=PE_raw, tau=0.5)

然而,我运行进入以下指示奇点的错误:

Error in rq.fit.br(x, y, tau = tau, ...) : Singular design matrix

I 运行 线性回归并应用 R 的别名函数进行进一步调查。这告诉我三种完全多重共线性的情况:

事后看来,这些案例都是有道理的。当 R 对分类变量进行二分法时,您可以通过构建得到这些结果,运费 1 + 运费 2 + 运费 3 = 1 和最小订货量 1 + 最小订货量 2 + 最小订货量 3 = 1。重写给出第一个公式。

它看起来像一个经典的虚拟陷阱。为了解决这个问题,我尝试手动对数据进行二分法并使用以下公式:

LM.factor <- rq(LogProb ~ Delivery.costs_1 + Delivery.costs_2 + Minimum.order.quantity_1 + Minimum.order.quantity_2 + Delivery.window_1 + Delivery.window_2 + factor(NoneOpt), data=PE_dichomitzed, tau=0.5)

我现在得到的不是错误消息:

    Warning message:
In rq.fit.br(x, y, tau = tau, ...) : Solution may be nonunique

使用汇总功能时:

 > summary(LM.factor)
Error in base::backsolve(r, x, k = k, upper.tri = upper.tri, transpose = transpose,  : 
  singular matrix in 'backsolve'. First zero in diagonal [2]
In addition: Warning message:
In summary.rq(LM.factor) : 153 non-positive fis

有人熟悉这个问题吗?我正在寻找替代解决方案。也许我在使用 rq() 函数时犯了错误,或者数据可能被歪曲了。

非常感谢您的任何意见,在此先感谢您。

可重现的例子

library(quantreg)

#### Raw dataset (PE_raw_SO) ####

# quantile regression (produces singularity error)
LAD.factor <- rq(
  LogProb ~ factor(`Delivery costs`) +
    factor(`Minimum order quantity`) + factor(`Delivery window`) +
    factor(NoneOpt),
  data = PE_raw_SO,
  tau = 0.5
) 

# linear regression to check for singularity
LM.factor <- lm(
  LogProb ~ factor(`Delivery costs`) +
    factor(`Minimum order quantity`) + factor(`Delivery window`) +
    factor(NoneOpt),
  data = PE_raw_SO
)
alias(LM.factor)

# impose assumptions on standard errors
summary(LM.factor, se = "iid")
summary(LM.factor, se = "boot")


#### Manually created dummy variables to get rid of
#### collinearity (PE_dichotomized_SO) ####
LAD.di.factor <- rq(
  LogProb ~ Delivery.costs_1 + Delivery.costs_2 +
    Minimum.order.quantity_1 + Minimum.order.quantity_2 +
    Delivery.window_1 + Delivery.window_2 + factor(NoneOpt),
  data = PE_dichotomized_SO,
  tau = 0.5
)

summary(LAD.di.factor)  #backsolve error

# impose assumptions (unusual results)
summary(LAD.di.factor, se = "iid") 
summary(LAD.di.factor, se = "boot")

# linear regression to check for singularity
LM.di.factor <- lm(
  LogProb ~ Delivery.costs_1 + Delivery.costs_2 +
    Minimum.order.quantity_1 + Minimum.order.quantity_2 +
    Delivery.window_1 + Delivery.window_2 + factor(NoneOpt),
  data = PE_dichotomized_SO
)
alias(LM.di.factor)

summary(LM.di.factor)  #regular results, all significant

Link 样本数据+代码:GitHub

使用虚拟解释变量进行分位数回归时,Solution may be nonunique 行为并不罕见。

请参见 quantreg FAQ:

The estimation of regression quantiles is a linear programming problem. And the optimal solution may not be unique.

Roger Koenker(quantreg 的作者)在 r-help back in 2006 上对正在发生的事情给出了更直观的解释:

When computing the median from a sample with an even number of distinct values there is inherently some ambiguity about its value: any value between the middle order statistics is "a" median. Similarly, in regression settings the optimization problem solved by the "br" version of the simplex algorithm, modified to do general quantile regression identifies cases where there may be non uniqueness of this type. When there are "continuous" covariates this is quite rare, when covariates are discrete then it is relatively common, atleast when tau is chosen from the rationals. For univariate quantiles R provides several methods of resolving this sort of ambiguity by interpolation, "br" doesn't try to do this, instead returning the first vertex solution that it comes to.

你的第二个警告——“153 non-positive fis”——是一个与 rq 计算局部密度的方式有关的警告。有时,分位数回归函数的局部密度可能最终为负(这显然是不可能的)。如果发生这种情况,rq 会自动将它们设置为零。再次引用 FAQ:

This is generally harmless, leading to a somewhat conservative (larger) estimate of the standard errors, however if the reported number of non-positive fis is large relative to the sample size then it is an indication of misspecification of the model.