如何在 r 中的两个数据帧上找到最匹配的日期?
How to find the best matching dates on two dataframes in r?
我有两个数据帧 insitu
和 model
:
dput(head(insitu,20))
structure(list(ID = c("AUR", "AUR", "AUR", "AUR", "AUR", "AUR",
"LAM", "LAM", "LAM", "LAM", "LAM", "LAM"), D_SOS = structure(c(16929,
17149, 17422, 17850, 18389, 18202, 17044, 16744, 17300, 17522,
18027, 18198), class = "Date"), D_EOS = structure(c(17067, 17353,
17712, 18082, 18516, 18360, 17123, 17002, 17414, 17722, 18148,
18446), class = "Date")), row.names = c(NA, -12L), class = c("tbl_df",
"tbl", "data.frame"))
dput(head(model,20))
structure(list(ID = c("AUR", "AUR", "AUR", "AUR", "AUR", "AUR",
"AUR", "AUR", "LAM", "LAM", "LAM", "LAM", "LAM", "LAM", "LAM"
), EVI_SOS = structure(c(16934, 17137, 17378, 17605, 17862, 18003,
18192, 18395, 16744, 17134, 17278, 17518, 17725, 18004, 18200
), class = "Date"), EVI_EOS = structure(c(17074, 17361, 17591,
17798, 17994, 18096, 18376, 18594, 17106, 17252, 17431, 17705,
17862, 18173, 18549), class = "Date")), row.names = c(NA, -15L
), class = c("tbl_df", "tbl", "data.frame"))
我想做的是:
我想找到所选列和相应行上两个数据框之间的最佳匹配日期。换句话说,在 D_SOS 列中的数据框 insitu
中截取列的 AUR 行ID 哪些日期最匹配列 EVI_SOS 截取行 AUR来自数据框 model
的列 ID。必须对 LAM 行进行相同的操作。
所需的输出示例为:
dput(head(output,20))
structure(list(ID = c("AUR", "AUR", "AUR", "AUR", "AUR", "AUR",
"LAM", "LAM", "LAM", "LAM", "LAM", "LAM"), D_SOS = structure(c(16929,
17149, 17422, 17850, 18389, 18202, 17044, 16744, 17300, 17522,
18027, 18198), class = "Date"), EVI_SOS = structure(c(16934,
17137, 17378, 17862, 18395, 18192, 17134, 16744, 17278, 17518,
18004, 18200), class = "Date"), D_EOS = structure(c(17067, 17353,
17712, 18082, 18516, 18360, 17123, 17002, 17414, 17722, 18148,
18446), class = "Date"), EVI_EOS = structure(c(17074, 17361,
17798, 18096, 18594, 18376, 17252, 17106, 17431, 17705, 18173,
18549), class = "Date")), row.names = c(NA, -12L), class = c("tbl_df",
"tbl", "data.frame"))
它看起来像这样:
ID D_SOS EVI_SOS D_EOS EVI_EOS
1 AUR 2016-05-08 2016-05-13 2016-09-23 2016-09-30
2 AUR 2016-12-14 2016-12-02 2017-07-06 2017-07-14
3 AUR 2017-09-13 2017-07-31 2018-06-30 2018-09-24
4 AUR 2018-11-15 2018-11-27 2019-07-05 2019-07-19
5 AUR 2020-05-17 2020-05-13 2020-09-11 2020-11-28
6 AUR 2019-11-02 2019-10-23 2020-04-08 2020-04-24
7 LAM 2016-08-31 2016-11-29 2016-11-18 2017-03-27
8 LAM 2015-11-05 2015-11-05 2016-07-20 2016-11-01
9 LAM 2017-05-14 2017-04-22 2017-09-05 2017-09-22
10 LAM 2017-12-22 2017-12-18 2018-07-10 2018-06-23
11 LAM 2019-05-11 2019-04-18 2019-09-09 2019-10-04
12 LAM 2019-10-29 2019-10-31 2020-07-03 2020-10-14
基本上,model
数据帧 6 的 8 个 AUR 日期将匹配,因为 insitu
在 AUR 中只有 6 个日期。对于 LAM model
数据框有 7 个日期,但 insitu
在 LAM 中有 6 个日期,这将是要匹配的数字。
输出将首先包含 insitu
的列,例如 D_SOS,然后是 model
的列,例如 EVI_SOS 与相应的日期匹配。
我们将不胜感激任何帮助。
library(data.table)
setDT(insitu)
setDT(model)
insitu[, key := D_SOS]
model[, key := EVI_SOS]
setkey(insitu, ID, key)
setkey(model, ID, key)
model[insitu, roll = "nearest"][, .(ID, D_SOS, EVI_SOS, D_EOS, EVI_EOS)]
# ID D_SOS EVI_SOS D_EOS EVI_EOS
# 1: AUR 2016-05-08 2016-05-13 2016-09-23 2016-09-30
# 2: AUR 2016-12-14 2016-12-02 2017-07-06 2017-07-14
# 3: AUR 2017-09-13 2017-07-31 2018-06-30 2018-03-01
# 4: AUR 2018-11-15 2018-11-27 2019-07-05 2019-04-08
# 5: AUR 2019-11-02 2019-10-23 2020-04-08 2020-04-24
# 6: AUR 2020-05-07 2020-05-13 2020-09-11 2020-11-28
# 7: LAM 2015-11-05 2015-11-05 2016-07-20 2016-11-01
# 8: LAM 2016-08-31 2016-11-29 2016-11-18 2017-03-27
# 9: LAM 2017-05-14 2017-04-22 2017-09-05 2017-09-22
# 10: LAM 2017-12-22 2017-12-18 2018-07-10 2018-06-23
# 11: LAM 2019-05-11 2019-04-18 2019-09-09 2019-10-04
# 12: LAM 2019-10-29 2019-10-31 2020-07-03 2020-10-14
我有两个数据帧 insitu
和 model
:
dput(head(insitu,20))
structure(list(ID = c("AUR", "AUR", "AUR", "AUR", "AUR", "AUR",
"LAM", "LAM", "LAM", "LAM", "LAM", "LAM"), D_SOS = structure(c(16929,
17149, 17422, 17850, 18389, 18202, 17044, 16744, 17300, 17522,
18027, 18198), class = "Date"), D_EOS = structure(c(17067, 17353,
17712, 18082, 18516, 18360, 17123, 17002, 17414, 17722, 18148,
18446), class = "Date")), row.names = c(NA, -12L), class = c("tbl_df",
"tbl", "data.frame"))
dput(head(model,20))
structure(list(ID = c("AUR", "AUR", "AUR", "AUR", "AUR", "AUR",
"AUR", "AUR", "LAM", "LAM", "LAM", "LAM", "LAM", "LAM", "LAM"
), EVI_SOS = structure(c(16934, 17137, 17378, 17605, 17862, 18003,
18192, 18395, 16744, 17134, 17278, 17518, 17725, 18004, 18200
), class = "Date"), EVI_EOS = structure(c(17074, 17361, 17591,
17798, 17994, 18096, 18376, 18594, 17106, 17252, 17431, 17705,
17862, 18173, 18549), class = "Date")), row.names = c(NA, -15L
), class = c("tbl_df", "tbl", "data.frame"))
我想做的是:
我想找到所选列和相应行上两个数据框之间的最佳匹配日期。换句话说,在 D_SOS 列中的数据框 insitu
中截取列的 AUR 行ID 哪些日期最匹配列 EVI_SOS 截取行 AUR来自数据框 model
的列 ID。必须对 LAM 行进行相同的操作。
所需的输出示例为:
dput(head(output,20))
structure(list(ID = c("AUR", "AUR", "AUR", "AUR", "AUR", "AUR",
"LAM", "LAM", "LAM", "LAM", "LAM", "LAM"), D_SOS = structure(c(16929,
17149, 17422, 17850, 18389, 18202, 17044, 16744, 17300, 17522,
18027, 18198), class = "Date"), EVI_SOS = structure(c(16934,
17137, 17378, 17862, 18395, 18192, 17134, 16744, 17278, 17518,
18004, 18200), class = "Date"), D_EOS = structure(c(17067, 17353,
17712, 18082, 18516, 18360, 17123, 17002, 17414, 17722, 18148,
18446), class = "Date"), EVI_EOS = structure(c(17074, 17361,
17798, 18096, 18594, 18376, 17252, 17106, 17431, 17705, 18173,
18549), class = "Date")), row.names = c(NA, -12L), class = c("tbl_df",
"tbl", "data.frame"))
它看起来像这样:
ID D_SOS EVI_SOS D_EOS EVI_EOS
1 AUR 2016-05-08 2016-05-13 2016-09-23 2016-09-30
2 AUR 2016-12-14 2016-12-02 2017-07-06 2017-07-14
3 AUR 2017-09-13 2017-07-31 2018-06-30 2018-09-24
4 AUR 2018-11-15 2018-11-27 2019-07-05 2019-07-19
5 AUR 2020-05-17 2020-05-13 2020-09-11 2020-11-28
6 AUR 2019-11-02 2019-10-23 2020-04-08 2020-04-24
7 LAM 2016-08-31 2016-11-29 2016-11-18 2017-03-27
8 LAM 2015-11-05 2015-11-05 2016-07-20 2016-11-01
9 LAM 2017-05-14 2017-04-22 2017-09-05 2017-09-22
10 LAM 2017-12-22 2017-12-18 2018-07-10 2018-06-23
11 LAM 2019-05-11 2019-04-18 2019-09-09 2019-10-04
12 LAM 2019-10-29 2019-10-31 2020-07-03 2020-10-14
基本上,model
数据帧 6 的 8 个 AUR 日期将匹配,因为 insitu
在 AUR 中只有 6 个日期。对于 LAM model
数据框有 7 个日期,但 insitu
在 LAM 中有 6 个日期,这将是要匹配的数字。
输出将首先包含 insitu
的列,例如 D_SOS,然后是 model
的列,例如 EVI_SOS 与相应的日期匹配。
我们将不胜感激任何帮助。
library(data.table)
setDT(insitu)
setDT(model)
insitu[, key := D_SOS]
model[, key := EVI_SOS]
setkey(insitu, ID, key)
setkey(model, ID, key)
model[insitu, roll = "nearest"][, .(ID, D_SOS, EVI_SOS, D_EOS, EVI_EOS)]
# ID D_SOS EVI_SOS D_EOS EVI_EOS
# 1: AUR 2016-05-08 2016-05-13 2016-09-23 2016-09-30
# 2: AUR 2016-12-14 2016-12-02 2017-07-06 2017-07-14
# 3: AUR 2017-09-13 2017-07-31 2018-06-30 2018-03-01
# 4: AUR 2018-11-15 2018-11-27 2019-07-05 2019-04-08
# 5: AUR 2019-11-02 2019-10-23 2020-04-08 2020-04-24
# 6: AUR 2020-05-07 2020-05-13 2020-09-11 2020-11-28
# 7: LAM 2015-11-05 2015-11-05 2016-07-20 2016-11-01
# 8: LAM 2016-08-31 2016-11-29 2016-11-18 2017-03-27
# 9: LAM 2017-05-14 2017-04-22 2017-09-05 2017-09-22
# 10: LAM 2017-12-22 2017-12-18 2018-07-10 2018-06-23
# 11: LAM 2019-05-11 2019-04-18 2019-09-09 2019-10-04
# 12: LAM 2019-10-29 2019-10-31 2020-07-03 2020-10-14