R 的 Cox 回归中的分段错误与 "exact" 关系

Question

我正在尝试拟合大型离散比例风险模型（~100k 行，~10k 事件）。为此，我按照 survival package documentation 文档的建议使用了 coxph(..., method = "exact")，其中指出：

The “exact partial likelihood” is equivalent to a conditional logistic model, and is appropriate when the times are a small set of discrete values. If there are a large number of ties and (start, stop) style survival data the computational time will be excessive.

有一些关于 coxph 和大量关系的计算困难的警告，但根据同一包中 clogit 的文档：

The computation of the exact partial likelihood can be very slow, however. If a particular strata had, say 10 events out of 20 subjects we have to add up a denominator that involves all possible ways of choosing 10 out of 20, which is 20!/(10! 10!) = 184756 terms. Gail et al describe a fast recursion method which largely ameleorates this; it was incorporated into version 2.36-11 of the survival package.

所以我没想到计算问题太糟糕了。尽管如此，当我试图在我的数据集上拟合一个普通（单预测变量）Cox 模型的变体时，我运行遇到了许多分段错误。一个是 "C stack overflow," 导致简短而有趣（且无信息）的消息：

Error: segfault from C stack overflow
Execution halted

另一个是 "memory not mapped" 错误，当我不小心翻转了 "event" 布尔值时发生，这样我就有了 ~90k 事件而不是 ~10k:

 *** caught segfault ***
address 0xffffffffac577830, cause 'memory not mapped'

Traceback:
 1: fitter(X, Y, strats, offset, init, control, weights = weights,     method = method, row.names(mf))
 2: coxph(Surv(time, status == EVENT.STATUS) ~ litter, data = data,     method = "exact")
aborting ...

供参考，我运行ning 的代码只是 coxph(Surv(t, d) ~ x, data = data, method = 'exact')。 t 是整数列，d 是布尔值，x 是浮点数。

这些是已知问题吗？有解决方法吗？

编辑：这里有一些代码重现了 rats 数据集上的问题（复制了 1000 次）：

library(survival)
print("constructing data")
data <- rats
SIZE <- nrow(rats)
# passes with 100 reps, but fails with 100 on my machine (MacBook Pro, 16g RAM)
REPS <- 1000
# set to 0 for "memory not mapped", 1 for "C stack overflow"
EVENT.STATUS <- 0
data <- data[rep(seq_len(SIZE), REPS), ]
print(summary(data$status == EVENT.STATUS))
print("fitting model")
fit <- coxph(Surv(time, status == EVENT.STATUS) ~ litter,
             data = data, method = "exact")

这里是 version:

platform       x86_64-apple-darwin14.0.0
arch           x86_64
os             darwin14.0.0
system         x86_64, darwin14.0.0
status
major          3
minor          1.2
year           2014
month          10
day            31
svn rev        66913
language       R
version.string R version 3.1.2 (2014-10-31)
nickname       Pumpkin Helmet

Answer 1

我能够使用该数据集制作泊松模型。（我有一个大型数据集，我不愿意冒可能出现段错误的风险。）

fit <- glm(  I(status == 0) ~ litter +offset(log(time)), 
               data = data, family=poisson)

> fit

Call:  glm(formula = I(status == 0) ~ litter + offset(log(time)), family = poisson, 
    data = data)

Coefficients:
(Intercept)       litter  
  -4.706485    -0.003883  

Degrees of Freedom: 149999 Total (i.e. Null);  149998 Residual
Null Deviance:      60500 
Residual Deviance: 60150    AIC: 280200

litter 的效果估计应该类似于您从 Cox PH 模型中得到的结果。

如果您想查看 "offset trick" 文档，请转到 Breslow 和 Day 的经典专着："Statistical Methods in Cancer Research; Vol II- The Design and Analysis of Cohort Studies"。他们使用了 GLIM 软件包，但代码与 R 的 glm 实现非常相似，因此概念的传输应该是直截了当的。（我有机会在我的硕士论文中使用 GLIM 与 Norm Breslow 进行了短暂的合作。他很聪明。我认为正是我之前在 GLIM 的培训使得学习 R 变得如此容易。）

R 的 Cox 回归中的分段错误与 "exact" 关系

Segmentation faults in R's Cox regression with "exact" ties

stack-overflow

r

segmentation-fault

cox-regression