R 的 Cox 回归中的分段错误与 "exact" 关系

Segmentation faults in R's Cox regression with "exact" ties

我正在尝试拟合大型离散比例风险模型(~100k 行,~10k 事件)。为此,我按照 survival package documentation 文档的建议使用了 coxph(..., method = "exact"),其中指出:

The “exact partial likelihood” is equivalent to a conditional logistic model, and is appropriate when the times are a small set of discrete values. If there are a large number of ties and (start, stop) style survival data the computational time will be excessive.

有一些关于 coxph 和大量关系的计算困难的警告,但根据同一包中 clogit 的文档:

The computation of the exact partial likelihood can be very slow, however. If a particular strata had, say 10 events out of 20 subjects we have to add up a denominator that involves all possible ways of choosing 10 out of 20, which is 20!/(10! 10!) = 184756 terms. Gail et al describe a fast recursion method which largely ameleorates this; it was incorporated into version 2.36-11 of the survival package.

所以我没想到计算问题太糟糕了。尽管如此,当我试图在我的数据集上拟合一个普通(单预测变量)Cox 模型的变体时,我 运行 遇到了许多分段错误。一个是 "C stack overflow," 导致简短而有趣(且无信息)的消息:

Error: segfault from C stack overflow
Execution halted

另一个是 "memory not mapped" 错误,当我不小心翻转了 "event" 布尔值时发生,这样我就有了 ~90k 事件而不是 ~10k:

 *** caught segfault ***
address 0xffffffffac577830, cause 'memory not mapped'

Traceback:
 1: fitter(X, Y, strats, offset, init, control, weights = weights,     method = method, row.names(mf))
 2: coxph(Surv(time, status == EVENT.STATUS) ~ litter, data = data,     method = "exact")
aborting ...

供参考,我 运行ning 的代码只是 coxph(Surv(t, d) ~ x, data = data, method = 'exact')t 是整数列,d 是布尔值,x 是浮点数。

这些是已知问题吗?有解决方法吗?

编辑:这里有一些代码重现了 rats 数据集上的问题(复制了 1000 次):

library(survival)
print("constructing data")
data <- rats
SIZE <- nrow(rats)
# passes with 100 reps, but fails with 100 on my machine (MacBook Pro, 16g RAM)
REPS <- 1000
# set to 0 for "memory not mapped", 1 for "C stack overflow"
EVENT.STATUS <- 0
data <- data[rep(seq_len(SIZE), REPS), ]
print(summary(data$status == EVENT.STATUS))
print("fitting model")
fit <- coxph(Surv(time, status == EVENT.STATUS) ~ litter,
             data = data, method = "exact")

这里是 version:

platform       x86_64-apple-darwin14.0.0
arch           x86_64
os             darwin14.0.0
system         x86_64, darwin14.0.0
status
major          3
minor          1.2
year           2014
month          10
day            31
svn rev        66913
language       R
version.string R version 3.1.2 (2014-10-31)
nickname       Pumpkin Helmet

我能够使用该数据集制作泊松模型。 (我有一个大型数据集,我不愿意冒可能出现段错误的风险。)

fit <- glm(  I(status == 0) ~ litter +offset(log(time)), 
               data = data, family=poisson)

> fit

Call:  glm(formula = I(status == 0) ~ litter + offset(log(time)), family = poisson, 
    data = data)

Coefficients:
(Intercept)       litter  
  -4.706485    -0.003883  

Degrees of Freedom: 149999 Total (i.e. Null);  149998 Residual
Null Deviance:      60500 
Residual Deviance: 60150    AIC: 280200

litter 的效果估计应该类似于您从 Cox PH 模型中得到的结果。

如果您想查看 "offset trick" 文档,请转到 Breslow 和 Day 的经典专着:"Statistical Methods in Cancer Research; Vol II- The Design and Analysis of Cohort Studies"。他们使用了 GLIM 软件包,但代码与 R 的 glm 实现非常相似,因此概念的传输应该是直截了当的。 (我有机会在我的硕士论文中使用 GLIM 与 Norm Breslow 进行了短暂的合作。他很聪明。我认为正是我之前在 GLIM 的培训使得学习 R 变得如此容易。)