data.table 中计算的 .BY 的操作
Operations on a calculated .BY in data.table
作为 this question 的扩展,我想 运行 计算包含一个 .BY
变量,该变量本身就是计算的产物。我审查过的问题使用的键仅访问现有值,但不会转换或聚合现有值。
在这个例子中,我试图为二元分类器生成一个 ROC,其函数利用了 data.table
(因为现有包中的 ROC 计算非常慢)。在这种情况下,.BY
变量是分界点,计算的是该分界点处概率估计的真阳性率和假阳性率。
我可以使用中间方法 data.table
来做到这一点,但我正在寻找更有效的解决方案。这有效:
# dummy example
library(data.table)
dt <- setDT(get(data(GermanCredit, package='caret'))
)[, `:=`(y = as.integer(Class=='Bad'),
Class = NULL)]
model <- glm(y ~ ., family='binomial', data=dt)
dt[,y_est := predict(model, type='response')]
#--- Generate ROC with specified # of cutpoints ---
# level of resolution of ROC curve -- up to uniqueN(y_est)
res <- 5
# vector of cutpoints (thresholds for y_est)
cuts <- dt[,.( thresh=quantile(y_est, probs=0:res/res) )]
# at y_est >= each threshold, how many true positive and false positives?
roc <- cuts[, .( tpr = dt[y_est>=.BY[[1]],sum(y==1)]/dt[,sum(y==1)],
fpr = dt[y_est>=.BY[[1]],sum(y==0)]/dt[,sum(y==0)]
), by=thresh]
plot(tpr~fpr,data=roc,type='s') # looks right
但这行不通:
# this doesn't work, and doesn't have access to the total positives & negatives
dt[, .(tp=sum( (y_est>=.BY[[1]]) & (y==1) ),
fp=sum( (y_est>=.BY[[1]]) & (y==0) ) ),
keyby=.(thresh= quantile(y_est, probs=0:res/res) )]
# Error in `[.data.table`(dt, , .(tp = sum((y_est >= .BY[[1]]) & (y == 1)), :
# The items in the 'by' or 'keyby' list are length (6).
# Each must be same length as rows in x or number of rows returned by i (1000).
是否有惯用的 data.table(或至少更有效)的方法来做到这一点?
您可以使用非等连接:
dt[.(thresh = quantile(y_est, probs=0:res/res)), on = .(y_est >= thresh),
.(fp = sum(y == 0), tp = sum(y == 1)), by = .EACHI][,
lapply(.SD, function(x) x/x[1]), .SDcols = -"y_est"]
# fp tp
#1: 1.00000000 1.000000000
#2: 0.72714286 0.970000000
#3: 0.46857143 0.906666667
#4: 0.24142857 0.770000000
#5: 0.08142857 0.476666667
#6: 0.00000000 0.003333333
作为 this question 的扩展,我想 运行 计算包含一个 .BY
变量,该变量本身就是计算的产物。我审查过的问题使用的键仅访问现有值,但不会转换或聚合现有值。
在这个例子中,我试图为二元分类器生成一个 ROC,其函数利用了 data.table
(因为现有包中的 ROC 计算非常慢)。在这种情况下,.BY
变量是分界点,计算的是该分界点处概率估计的真阳性率和假阳性率。
我可以使用中间方法 data.table
来做到这一点,但我正在寻找更有效的解决方案。这有效:
# dummy example
library(data.table)
dt <- setDT(get(data(GermanCredit, package='caret'))
)[, `:=`(y = as.integer(Class=='Bad'),
Class = NULL)]
model <- glm(y ~ ., family='binomial', data=dt)
dt[,y_est := predict(model, type='response')]
#--- Generate ROC with specified # of cutpoints ---
# level of resolution of ROC curve -- up to uniqueN(y_est)
res <- 5
# vector of cutpoints (thresholds for y_est)
cuts <- dt[,.( thresh=quantile(y_est, probs=0:res/res) )]
# at y_est >= each threshold, how many true positive and false positives?
roc <- cuts[, .( tpr = dt[y_est>=.BY[[1]],sum(y==1)]/dt[,sum(y==1)],
fpr = dt[y_est>=.BY[[1]],sum(y==0)]/dt[,sum(y==0)]
), by=thresh]
plot(tpr~fpr,data=roc,type='s') # looks right
但这行不通:
# this doesn't work, and doesn't have access to the total positives & negatives
dt[, .(tp=sum( (y_est>=.BY[[1]]) & (y==1) ),
fp=sum( (y_est>=.BY[[1]]) & (y==0) ) ),
keyby=.(thresh= quantile(y_est, probs=0:res/res) )]
# Error in `[.data.table`(dt, , .(tp = sum((y_est >= .BY[[1]]) & (y == 1)), :
# The items in the 'by' or 'keyby' list are length (6).
# Each must be same length as rows in x or number of rows returned by i (1000).
是否有惯用的 data.table(或至少更有效)的方法来做到这一点?
您可以使用非等连接:
dt[.(thresh = quantile(y_est, probs=0:res/res)), on = .(y_est >= thresh),
.(fp = sum(y == 0), tp = sum(y == 1)), by = .EACHI][,
lapply(.SD, function(x) x/x[1]), .SDcols = -"y_est"]
# fp tp
#1: 1.00000000 1.000000000
#2: 0.72714286 0.970000000
#3: 0.46857143 0.906666667
#4: 0.24142857 0.770000000
#5: 0.08142857 0.476666667
#6: 0.00000000 0.003333333