data.table 中的滚动加入使用了错误的时区
Rolling join in data.table is using wrong time zone
我正在尝试使用滚动连接来连接两个数据 table。我的 time/date 邮票是指定为东部标准时区 (EST) 的 POSIXct。在连接期间,取而代之的是检索使用 GMT 的值,即来自 EST - 5 小时的值。此外,它正在更正夏令时,从 EST 检索值 - 夏令时 4 小时。
我试过使用 fuzzyjoin
并且它工作正常,但即使对于小连接也非常慢,并且 运行s 会因为更大的问题而耗尽资源(我猜是因为它正在做一个完整的笛卡尔连接然后过滤)所以不会 运行 我真正的工作。我查看了 stackoverlow,但没有看到描述的这个问题
在下面的示例中,流量数据是每隔 15 分钟测量一次的流量(以立方米每秒为单位),化学数据是分析抓样的结果。我想要的是找到在时间上最接近抓取样本收集时间的流量值 (Q.cms)。结果,从流数据中检索到的 Q.cms 应该匹配 flow.EST 而不是 flow.GMT (flow.EST 和 flow.GMT 是使用 [=29 中的 vlookup 函数获得的=]).您可以查看结果 table 或使用 ggplot 可视化结果。黑色符号应该落在 1:1 线上,而不是红色符号。其中一些遵循此处给出的示例 https://r-norberg.blogspot.com/2016/06/understanding-datatable-rolling-joins.html。在 roll 语句中,我还尝试了“-Inf”和 "T" - 都没有帮助。
library(tidyverse)
library(data.table)
NHchem <- tibble(SampleDateTime = list("2012-03-07 15:15:00",
"2012-03-07 07:57:00",
"2012-03-07 09:00:00",
"2012-03-07 08:31:00",
"2012-03-07 08:00:00"),
Parameter = rep("HgD", 5),
Value = c(11.83, 9.09, 8.98, 10.91, 10.91),
flow.EST = c(1.734292, 1.730299, 1.711665, 1.721448, 1.729299),
flow.GMT = c(1.729082, 1.855115, 1.818062, 1.838623, 1.855263)
)
NHchem$SampleDateTime <- lubridate::ymd_hms(NHchem$SampleDateTime, tz = "EST")
NHflow <- tibble(DateTime = list("2012-03-07 00:00:00", "2012-03-07 00:15:00", "2012-03-07 00:30:00",
"2012-03-07 00:45:00", "2012-03-07 01:00:00", "2012-03-07 01:15:00",
"2012-03-07 01:30:00", "2012-03-07 01:45:00", "2012-03-07 02:00:00",
"2012-03-07 02:15:00", "2012-03-07 02:30:00", "2012-03-07 02:45:00",
"2012-03-07 03:00:00", "2012-03-07 03:15:00", "2012-03-07 03:30:00",
"2012-03-07 03:45:00", "2012-03-07 04:00:00", "2012-03-07 04:15:00",
"2012-03-07 04:30:00", "2012-03-07 04:45:00", "2012-03-07 05:00:00",
"2012-03-07 05:15:00", "2012-03-07 05:30:00", "2012-03-07 05:45:00",
"2012-03-07 06:00:00", "2012-03-07 06:15:00", "2012-03-07 06:30:00",
"2012-03-07 06:45:00", "2012-03-07 07:00:00", "2012-03-07 07:15:00",
"2012-03-07 07:30:00", "2012-03-07 07:45:00", "2012-03-07 08:00:00",
"2012-03-07 08:15:00", "2012-03-07 08:30:00", "2012-03-07 08:45:00",
"2012-03-07 09:00:00", "2012-03-07 09:15:00", "2012-03-07 09:30:00",
"2012-03-07 09:45:00", "2012-03-07 10:00:00", "2012-03-07 10:15:00",
"2012-03-07 10:30:00", "2012-03-07 10:45:00", "2012-03-07 11:00:00",
"2012-03-07 11:15:00", "2012-03-07 11:30:00", "2012-03-07 11:45:00",
"2012-03-07 12:00:00", "2012-03-07 12:15:00", "2012-03-07 12:30:00",
"2012-03-07 12:45:00", "2012-03-07 13:00:00", "2012-03-07 13:15:00",
"2012-03-07 13:30:00", "2012-03-07 13:45:00", "2012-03-07 14:00:00",
"2012-03-07 14:15:00", "2012-03-07 14:30:00", "2012-03-07 14:45:00",
"2012-03-07 15:00:00", "2012-03-07 15:15:00", "2012-03-07 15:30:00",
"2012-03-07 15:45:00", "2012-03-07 16:00:00", "2012-03-07 16:15:00",
"2012-03-07 16:30:00", "2012-03-07 16:45:00", "2012-03-07 17:00:00",
"2012-03-07 17:15:00", "2012-03-07 17:30:00", "2012-03-07 17:45:00",
"2012-03-07 18:00:00", "2012-03-07 18:15:00", "2012-03-07 18:30:00",
"2012-03-07 18:45:00", "2012-03-07 19:00:00", "2012-03-07 19:15:00",
"2012-03-07 19:30:00", "2012-03-07 19:45:00", "2012-03-07 20:00:00",
"2012-03-07 20:15:00", "2012-03-07 20:30:00", "2012-03-07 20:45:00",
"2012-03-07 21:00:00", "2012-03-07 21:15:00", "2012-03-07 21:30:00",
"2012-03-07 21:45:00", "2012-03-07 22:00:00", "2012-03-07 22:15:00",
"2012-03-07 22:30:00", "2012-03-07 22:45:00", "2012-03-07 23:00:00",
"2012-03-07 23:15:00", "2012-03-07 23:30:00", "2012-03-07 23:45:00"
),
Q.cms = c(1.9183428976162, 1.90884554590853, 1.90659614969849, 1.89946370922413,
1.89667381849492, 1.88823268972675, 1.89009845845496, 1.88925133710857,
1.89175644143891, 1.88902902166044, 1.88199493586431, 1.87318875971247,
1.86917038183388, 1.86071830324102, 1.85932613686066, 1.86093401873381,
1.87101369024615, 1.87296981545807, 1.8762632040545, 1.86625922525504,
1.86445655629997, 1.86091440792692, 1.86261797589624, 1.85849714690774,
1.85809949483732, 1.85365423835426, 1.86769219811402, 1.87137568188775,
1.86957715377645, 1.86178854932717, 1.85842153505713, 1.85511511281169,
1.85526345829864, 1.84606992374992, 1.8386232796469, 1.8306698161104,
1.81806190831676, 1.80972662643993, 1.80671560966769, 1.79692071214778,
1.79246415779217, 1.78383758090983, 1.77911151556738, 1.77540849330392,
1.76977546791765, 1.76704647349273, 1.75506646810193, 1.75116656853259,
1.74116032441497, 1.73694852237565, 1.73539800459245, 1.73029929958069,
1.72929873358336, 1.72145096214292, 1.72144822481096, 1.7137286373663,
1.71166508485562, 1.72512015058183, 1.72801058727519, 1.71946126373971,
1.72908219540057, 1.73003063754833, 1.73433620637326, 1.74704929729279,
1.75374511203315, 1.75751451176466, 1.74550984385788, 1.7374425791972,
1.73028285052801, 1.73164553811319, 1.71538821542136, 1.71111583998635,
1.69850042893639, 1.7018112620745, 1.70068743638687, 1.70624871061042,
1.70513766128531, 1.70189855875322, 1.71388172504149, 1.70687667156756,
1.72965507980508, 1.73429231190258, 1.73062829040751, 1.74161628160761,
1.74502884750137, 1.74764319440737, 1.76775550519358, 1.74926846992512,
1.75406991794209, 1.75302950118107, 1.75219840078475, 1.75821138471585,
1.74749746409036, 1.75047602988317, 1.76192575356191, 1.76360735465132)
)
NHflow$DateTime <- lubridate::ymd_hms(NHflow$DateTime, tz = "EST")
NHflowDT <- setDT(NHflow)
NHchemDT <- setDT(NHchem)
NHflowDT[, join_time := DateTime]
NHchemDT[, join_time := SampleDateTime]
setkey(NHflowDT, join_time)
setkey(NHchemDT, join_time)
NHflow_chem <- NHflowDT[ NHchemDT, roll = 'nearest']
NHflow_chem
ggplot(data = NHflow_chem, mapping = aes(x = flow.EST, y = Q.cms)) +
geom_point(size = 4) +
geom_point(aes(x = flow.GMT, y = Q.cms), color = "red", size = 4) +
geom_abline(slope = 1, intercept = 0) +
labs(x = "flow.EST (black), flow.GMT (red)")
我不确定你的确切问题是什么,但当我查看你的示例数据集时,下面的滚动更新连接似乎工作正常...
NHflowDT <- setDT(NHflow)
NHchemDT <- setDT(NHchem)
NHchemDT[, c("nearest_flow_time", "Q.cms") :=
NHflowDT[NHchemDT, .(x.DateTime, x.Q.cms),
on = .(DateTime = SampleDateTime),
roll = "nearest"] ][]
输出
SampleDateTime Parameter Value flow.EST flow.GMT nearest_flow_time Q.cms
1: 2012-03-07 15:15:00 HgD 11.83 1.734292 1.729082 2012-03-07 15:15:00 1.730031
2: 2012-03-07 07:57:00 HgD 9.09 1.730299 1.855115 2012-03-07 08:00:00 1.855263
3: 2012-03-07 09:00:00 HgD 8.98 1.711665 1.818062 2012-03-07 09:00:00 1.818062
4: 2012-03-07 08:31:00 HgD 10.91 1.721448 1.838623 2012-03-07 08:30:00 1.838623
5: 2012-03-07 08:00:00 HgD 10.91 1.729299 1.855263 2012-03-07 08:00:00 1.855263
> attributes( NHchemDT$SampleDateTime)
$class
[1] "POSIXct" "POSIXt"
$tzone
[1] "EST"
> attributes( NHchemDT$nearest_flow_time)
$class
[1] "POSIXct" "POSIXt"
$tzone
[1] "EST"
我正在尝试使用滚动连接来连接两个数据 table。我的 time/date 邮票是指定为东部标准时区 (EST) 的 POSIXct。在连接期间,取而代之的是检索使用 GMT 的值,即来自 EST - 5 小时的值。此外,它正在更正夏令时,从 EST 检索值 - 夏令时 4 小时。
我试过使用 fuzzyjoin
并且它工作正常,但即使对于小连接也非常慢,并且 运行s 会因为更大的问题而耗尽资源(我猜是因为它正在做一个完整的笛卡尔连接然后过滤)所以不会 运行 我真正的工作。我查看了 stackoverlow,但没有看到描述的这个问题
在下面的示例中,流量数据是每隔 15 分钟测量一次的流量(以立方米每秒为单位),化学数据是分析抓样的结果。我想要的是找到在时间上最接近抓取样本收集时间的流量值 (Q.cms)。结果,从流数据中检索到的 Q.cms 应该匹配 flow.EST 而不是 flow.GMT (flow.EST 和 flow.GMT 是使用 [=29 中的 vlookup 函数获得的=]).您可以查看结果 table 或使用 ggplot 可视化结果。黑色符号应该落在 1:1 线上,而不是红色符号。其中一些遵循此处给出的示例 https://r-norberg.blogspot.com/2016/06/understanding-datatable-rolling-joins.html。在 roll 语句中,我还尝试了“-Inf”和 "T" - 都没有帮助。
library(tidyverse)
library(data.table)
NHchem <- tibble(SampleDateTime = list("2012-03-07 15:15:00",
"2012-03-07 07:57:00",
"2012-03-07 09:00:00",
"2012-03-07 08:31:00",
"2012-03-07 08:00:00"),
Parameter = rep("HgD", 5),
Value = c(11.83, 9.09, 8.98, 10.91, 10.91),
flow.EST = c(1.734292, 1.730299, 1.711665, 1.721448, 1.729299),
flow.GMT = c(1.729082, 1.855115, 1.818062, 1.838623, 1.855263)
)
NHchem$SampleDateTime <- lubridate::ymd_hms(NHchem$SampleDateTime, tz = "EST")
NHflow <- tibble(DateTime = list("2012-03-07 00:00:00", "2012-03-07 00:15:00", "2012-03-07 00:30:00",
"2012-03-07 00:45:00", "2012-03-07 01:00:00", "2012-03-07 01:15:00",
"2012-03-07 01:30:00", "2012-03-07 01:45:00", "2012-03-07 02:00:00",
"2012-03-07 02:15:00", "2012-03-07 02:30:00", "2012-03-07 02:45:00",
"2012-03-07 03:00:00", "2012-03-07 03:15:00", "2012-03-07 03:30:00",
"2012-03-07 03:45:00", "2012-03-07 04:00:00", "2012-03-07 04:15:00",
"2012-03-07 04:30:00", "2012-03-07 04:45:00", "2012-03-07 05:00:00",
"2012-03-07 05:15:00", "2012-03-07 05:30:00", "2012-03-07 05:45:00",
"2012-03-07 06:00:00", "2012-03-07 06:15:00", "2012-03-07 06:30:00",
"2012-03-07 06:45:00", "2012-03-07 07:00:00", "2012-03-07 07:15:00",
"2012-03-07 07:30:00", "2012-03-07 07:45:00", "2012-03-07 08:00:00",
"2012-03-07 08:15:00", "2012-03-07 08:30:00", "2012-03-07 08:45:00",
"2012-03-07 09:00:00", "2012-03-07 09:15:00", "2012-03-07 09:30:00",
"2012-03-07 09:45:00", "2012-03-07 10:00:00", "2012-03-07 10:15:00",
"2012-03-07 10:30:00", "2012-03-07 10:45:00", "2012-03-07 11:00:00",
"2012-03-07 11:15:00", "2012-03-07 11:30:00", "2012-03-07 11:45:00",
"2012-03-07 12:00:00", "2012-03-07 12:15:00", "2012-03-07 12:30:00",
"2012-03-07 12:45:00", "2012-03-07 13:00:00", "2012-03-07 13:15:00",
"2012-03-07 13:30:00", "2012-03-07 13:45:00", "2012-03-07 14:00:00",
"2012-03-07 14:15:00", "2012-03-07 14:30:00", "2012-03-07 14:45:00",
"2012-03-07 15:00:00", "2012-03-07 15:15:00", "2012-03-07 15:30:00",
"2012-03-07 15:45:00", "2012-03-07 16:00:00", "2012-03-07 16:15:00",
"2012-03-07 16:30:00", "2012-03-07 16:45:00", "2012-03-07 17:00:00",
"2012-03-07 17:15:00", "2012-03-07 17:30:00", "2012-03-07 17:45:00",
"2012-03-07 18:00:00", "2012-03-07 18:15:00", "2012-03-07 18:30:00",
"2012-03-07 18:45:00", "2012-03-07 19:00:00", "2012-03-07 19:15:00",
"2012-03-07 19:30:00", "2012-03-07 19:45:00", "2012-03-07 20:00:00",
"2012-03-07 20:15:00", "2012-03-07 20:30:00", "2012-03-07 20:45:00",
"2012-03-07 21:00:00", "2012-03-07 21:15:00", "2012-03-07 21:30:00",
"2012-03-07 21:45:00", "2012-03-07 22:00:00", "2012-03-07 22:15:00",
"2012-03-07 22:30:00", "2012-03-07 22:45:00", "2012-03-07 23:00:00",
"2012-03-07 23:15:00", "2012-03-07 23:30:00", "2012-03-07 23:45:00"
),
Q.cms = c(1.9183428976162, 1.90884554590853, 1.90659614969849, 1.89946370922413,
1.89667381849492, 1.88823268972675, 1.89009845845496, 1.88925133710857,
1.89175644143891, 1.88902902166044, 1.88199493586431, 1.87318875971247,
1.86917038183388, 1.86071830324102, 1.85932613686066, 1.86093401873381,
1.87101369024615, 1.87296981545807, 1.8762632040545, 1.86625922525504,
1.86445655629997, 1.86091440792692, 1.86261797589624, 1.85849714690774,
1.85809949483732, 1.85365423835426, 1.86769219811402, 1.87137568188775,
1.86957715377645, 1.86178854932717, 1.85842153505713, 1.85511511281169,
1.85526345829864, 1.84606992374992, 1.8386232796469, 1.8306698161104,
1.81806190831676, 1.80972662643993, 1.80671560966769, 1.79692071214778,
1.79246415779217, 1.78383758090983, 1.77911151556738, 1.77540849330392,
1.76977546791765, 1.76704647349273, 1.75506646810193, 1.75116656853259,
1.74116032441497, 1.73694852237565, 1.73539800459245, 1.73029929958069,
1.72929873358336, 1.72145096214292, 1.72144822481096, 1.7137286373663,
1.71166508485562, 1.72512015058183, 1.72801058727519, 1.71946126373971,
1.72908219540057, 1.73003063754833, 1.73433620637326, 1.74704929729279,
1.75374511203315, 1.75751451176466, 1.74550984385788, 1.7374425791972,
1.73028285052801, 1.73164553811319, 1.71538821542136, 1.71111583998635,
1.69850042893639, 1.7018112620745, 1.70068743638687, 1.70624871061042,
1.70513766128531, 1.70189855875322, 1.71388172504149, 1.70687667156756,
1.72965507980508, 1.73429231190258, 1.73062829040751, 1.74161628160761,
1.74502884750137, 1.74764319440737, 1.76775550519358, 1.74926846992512,
1.75406991794209, 1.75302950118107, 1.75219840078475, 1.75821138471585,
1.74749746409036, 1.75047602988317, 1.76192575356191, 1.76360735465132)
)
NHflow$DateTime <- lubridate::ymd_hms(NHflow$DateTime, tz = "EST")
NHflowDT <- setDT(NHflow)
NHchemDT <- setDT(NHchem)
NHflowDT[, join_time := DateTime]
NHchemDT[, join_time := SampleDateTime]
setkey(NHflowDT, join_time)
setkey(NHchemDT, join_time)
NHflow_chem <- NHflowDT[ NHchemDT, roll = 'nearest']
NHflow_chem
ggplot(data = NHflow_chem, mapping = aes(x = flow.EST, y = Q.cms)) +
geom_point(size = 4) +
geom_point(aes(x = flow.GMT, y = Q.cms), color = "red", size = 4) +
geom_abline(slope = 1, intercept = 0) +
labs(x = "flow.EST (black), flow.GMT (red)")
我不确定你的确切问题是什么,但当我查看你的示例数据集时,下面的滚动更新连接似乎工作正常...
NHflowDT <- setDT(NHflow)
NHchemDT <- setDT(NHchem)
NHchemDT[, c("nearest_flow_time", "Q.cms") :=
NHflowDT[NHchemDT, .(x.DateTime, x.Q.cms),
on = .(DateTime = SampleDateTime),
roll = "nearest"] ][]
输出
SampleDateTime Parameter Value flow.EST flow.GMT nearest_flow_time Q.cms
1: 2012-03-07 15:15:00 HgD 11.83 1.734292 1.729082 2012-03-07 15:15:00 1.730031
2: 2012-03-07 07:57:00 HgD 9.09 1.730299 1.855115 2012-03-07 08:00:00 1.855263
3: 2012-03-07 09:00:00 HgD 8.98 1.711665 1.818062 2012-03-07 09:00:00 1.818062
4: 2012-03-07 08:31:00 HgD 10.91 1.721448 1.838623 2012-03-07 08:30:00 1.838623
5: 2012-03-07 08:00:00 HgD 10.91 1.729299 1.855263 2012-03-07 08:00:00 1.855263
> attributes( NHchemDT$SampleDateTime)
$class
[1] "POSIXct" "POSIXt"
$tzone
[1] "EST"
> attributes( NHchemDT$nearest_flow_time)
$class
[1] "POSIXct" "POSIXt"
$tzone
[1] "EST"