data.table 没有按组返回正确的 splinefun
data.table not returning the correct splinefun by group
我们最近将 data.table
从版本 1.12.0
更新到 1.12.8
,并将 R 从 3.5.3
更新到 3.6.3
。该示例位于 Windows OS.
我们有一个 data.table
,我们在其中遍历类别列并创建一个 splinefun
对象以供稍后使用。我们将此 splinefun
函数输出存储到 list
中的 data.table
列中。它在我们的旧规格上按预期工作,根据分段数据为每个类别级别生成一个 splinefun
唯一。但是,现在看起来它只是保留最终类别的值并将其解析到所有条目中。
设置数据
创建一些假数据来显示问题
# R version: 3.6.3 (2020-02-29)
library(data.table) # data.table_1.12.8
library(ggplot2)
library(stats)
# mimic our data in simpler format
set.seed(1)
dt <- data.table(cat = rep(letters[1:3], each = 10),
x = 1:10)
dt[, y := x^0.5 * rnorm(.N, mean=runif(1, 1, 100), sd=runif(1, 1, 10)), by=cat]
# can see that each line is different
pl0 <- ggplot(data=dt, aes(x=x, y=y, col=cat)) + geom_line()
pl0
拟合样条曲线
通过我们当前的方法拟合样条并使用 lapply
进行比较。 lapply
按预期工作,data.table
没有。
# fit spline, segment the data by category
mod_splines <- dt[, .(Spline = list(splinefun(x=x, y=y, method = "natural"))),
by = c("cat")]
# splinefun works such that you provide new values of x and it gives an output
# y from a spline fitted to y~x
# Can see they are all the same, which seems unlikely
mod_splines$Spline[[1]](5)
mod_splines$Spline[[2]](5)
mod_splines$Spline[[3]](5)
# alternative approach
alt_splines <- lapply(unique(dt$cat), function(x_cat){
splinefun(x=dt[cat==x_cat, ]$x,
y=dt[cat==x_cat, ]$y,
method = "natural")
})
# looks more realistic
alt_splines[[1]](5)
alt_splines[[2]](5)
alt_splines[[3]](5) # Matches the mod_splines one!
检查splinefun
是否合适
当我们从 data.table 循环中打印出来时,splinefun
的数据和输出看起来是正确的,但它没有被正确存储。
# check the data is segmenting
mod_splines2 <- dt[, .(Spline = list(splinefun(x=x, y=y, method = "natural")),
x=x, y=y),
by = c("cat")]
mod_splines2[] # the data is definitely segmenting ok
# try catching and printing the data
splinefun_withmorefun <- function(x, y){
writeLines(paste(x, collapse =", "))
writeLines(paste(round(y, 0), collapse =", "))
foo <- splinefun(x=x,
y=y,
method = "natural")
writeLines(paste(foo(5), collapse =", "))
writeLines("")
return(foo)
}
# looks like its in the function ok, as it prints out different results
mod_splines3 <- dt[, .(Spline = list(splinefun_withmorefun(x=x, y=y))),
by = c("cat")]
# but not coming through in to the listed function
mod_splines3$Spline[[1]](5)
mod_splines3$Spline[[2]](5)
mod_splines3$Spline[[3]](5)
如果知道为什么这会在更新后成为问题,那就太好了!我们担心可能会有其他案例使用类似的 data.table
方法,现在可能会像这个案例一样悄无声息地被破坏。
谢谢,
强尼
正如我在 https://github.com/Rdatatable/data.table/issues/4298#issuecomment-597737776 中回答的那样,在 x
和 y
变量上添加 copy()
将解决此问题。
原因是 splinefun()
会尝试存储 x
和 y
的值。但是,data.table
的内部对象总是通过引用传递(为了速度)...在这种情况下,您可能必须显式 copy()
变量才能获得预期的答案。
总之,改变
mod_splines <- dt[, .(Spline = list(splinefun(x=x, y=y, method = "natural"))),
by = c("cat")]
到
mod_splines <- dt[, .(Spline = list(splinefun(x=copy(x), y=copy(y), method = "natural"))),
by = c("cat")]
或者这个(你可以忽略这个,但它可能会让你更好地理解)
mod_splines <- dt[, .(Spline = list(splinefun(x=x+0, y=y+0, method = "natural"))),
by = cat]
够了。
我们最近将 data.table
从版本 1.12.0
更新到 1.12.8
,并将 R 从 3.5.3
更新到 3.6.3
。该示例位于 Windows OS.
我们有一个 data.table
,我们在其中遍历类别列并创建一个 splinefun
对象以供稍后使用。我们将此 splinefun
函数输出存储到 list
中的 data.table
列中。它在我们的旧规格上按预期工作,根据分段数据为每个类别级别生成一个 splinefun
唯一。但是,现在看起来它只是保留最终类别的值并将其解析到所有条目中。
设置数据
创建一些假数据来显示问题
# R version: 3.6.3 (2020-02-29)
library(data.table) # data.table_1.12.8
library(ggplot2)
library(stats)
# mimic our data in simpler format
set.seed(1)
dt <- data.table(cat = rep(letters[1:3], each = 10),
x = 1:10)
dt[, y := x^0.5 * rnorm(.N, mean=runif(1, 1, 100), sd=runif(1, 1, 10)), by=cat]
# can see that each line is different
pl0 <- ggplot(data=dt, aes(x=x, y=y, col=cat)) + geom_line()
pl0
拟合样条曲线
通过我们当前的方法拟合样条并使用 lapply
进行比较。 lapply
按预期工作,data.table
没有。
# fit spline, segment the data by category
mod_splines <- dt[, .(Spline = list(splinefun(x=x, y=y, method = "natural"))),
by = c("cat")]
# splinefun works such that you provide new values of x and it gives an output
# y from a spline fitted to y~x
# Can see they are all the same, which seems unlikely
mod_splines$Spline[[1]](5)
mod_splines$Spline[[2]](5)
mod_splines$Spline[[3]](5)
# alternative approach
alt_splines <- lapply(unique(dt$cat), function(x_cat){
splinefun(x=dt[cat==x_cat, ]$x,
y=dt[cat==x_cat, ]$y,
method = "natural")
})
# looks more realistic
alt_splines[[1]](5)
alt_splines[[2]](5)
alt_splines[[3]](5) # Matches the mod_splines one!
检查splinefun
是否合适
当我们从 data.table 循环中打印出来时,splinefun
的数据和输出看起来是正确的,但它没有被正确存储。
# check the data is segmenting
mod_splines2 <- dt[, .(Spline = list(splinefun(x=x, y=y, method = "natural")),
x=x, y=y),
by = c("cat")]
mod_splines2[] # the data is definitely segmenting ok
# try catching and printing the data
splinefun_withmorefun <- function(x, y){
writeLines(paste(x, collapse =", "))
writeLines(paste(round(y, 0), collapse =", "))
foo <- splinefun(x=x,
y=y,
method = "natural")
writeLines(paste(foo(5), collapse =", "))
writeLines("")
return(foo)
}
# looks like its in the function ok, as it prints out different results
mod_splines3 <- dt[, .(Spline = list(splinefun_withmorefun(x=x, y=y))),
by = c("cat")]
# but not coming through in to the listed function
mod_splines3$Spline[[1]](5)
mod_splines3$Spline[[2]](5)
mod_splines3$Spline[[3]](5)
如果知道为什么这会在更新后成为问题,那就太好了!我们担心可能会有其他案例使用类似的 data.table
方法,现在可能会像这个案例一样悄无声息地被破坏。
谢谢, 强尼
正如我在 https://github.com/Rdatatable/data.table/issues/4298#issuecomment-597737776 中回答的那样,在 x
和 y
变量上添加 copy()
将解决此问题。
原因是 splinefun()
会尝试存储 x
和 y
的值。但是,data.table
的内部对象总是通过引用传递(为了速度)...在这种情况下,您可能必须显式 copy()
变量才能获得预期的答案。
总之,改变
mod_splines <- dt[, .(Spline = list(splinefun(x=x, y=y, method = "natural"))),
by = c("cat")]
到
mod_splines <- dt[, .(Spline = list(splinefun(x=copy(x), y=copy(y), method = "natural"))),
by = c("cat")]
或者这个(你可以忽略这个,但它可能会让你更好地理解)
mod_splines <- dt[, .(Spline = list(splinefun(x=x+0, y=y+0, method = "natural"))),
by = cat]
够了。