如何检测 R 中数据框中给定参考变量下方和上方的最接近值?
How to detect the closest value below and above a given reference variable in a data frame in R?
考虑以下随机 MWE。
对于每一行,我试图确定哪个变量的值最接近常量 reference_day 以及哪个变量的值最接近常量 reference_day.
df1 <-
structure(
list(id = 1:5,
gender = c("female", "male", "male", "male", "male"),
reference_day = structure(c(18052, NA, 18052, 18052, 18052), class = "Date"),
var1 = structure(c(16505, 17144, 18139, NA, 16639), class = "Date"),
var2 = structure(c(NA, 18042, 16544, 16697, NA), class = "Date"),
var3 = structure(c(17845, 18070, 17152, 16571, NA), class = "Date")),
row.names = c(NA, -5L), class = "data.frame")
df1
id gender reference_day var1 var2 var3
1 1 female 2019-06-05 2015-03-11 <NA> 2018-11-10
2 2 male <NA> 2016-12-09 2019-05-26 2019-06-23
3 3 male 2019-06-05 2019-08-31 2015-04-19 2016-12-17
4 4 male 2019-06-05 <NA> 2015-09-19 2015-05-16
5 5 male 2019-06-05 2015-07-23 <NA> <NA>
我要的结果是这样的:
id gender reference_day var1 var2 var3 closest_to_left closest_to_right
1 1 female 2019-06-05 2015-03-11 <NA> 2018-11-10 var3 <NA>
2 2 male <NA> 2016-12-09 2019-05-26 2019-06-23 <NA> <NA>
3 3 male 2019-06-05 2019-08-31 2015-04-19 2016-12-17 var3 var1
4 4 male 2019-06-05 <NA> 2015-09-19 2015-05-16 var2 <NA>
5 5 male 2019-06-05 2015-07-23 <NA> <NA> var1 <NA>
经过多次尝试和错误,我实际上能够使用 dplyr 的 case_when 函数找到解决方案,但它需要大量的样板代码,这让我认为只有必须是一个更聪明的解决方案。
我个人更喜欢使用 dplyr,但非常感谢任何帮助。
执行此操作的自定义函数 -
library(dplyr)
cols <- df1 %>% select(starts_with('var')) %>% names
closest_to_right <- function(x, y) {
tmp <- y - x
if(any(tmp > 0, na.rm = TRUE))
cols[tmp %in% min(tmp[tmp > 0], na.rm = TRUE)] else NA
}
closest_to_left <- function(x, y) {
tmp <- y - x
if(any(tmp < 0, na.rm = TRUE))
cols[tmp %in% max(tmp[tmp < 0], na.rm = TRUE)] else NA
}
df1 %>%
rowwise() %>%
mutate(closest_to_left = closest_to_left(reference_day, c_across(starts_with('var'))),
closest_to_right = closest_to_right(reference_day, c_across(starts_with('var')))) %>%
ungroup
# id gender reference_day var1 var2 var3 closest_to_left closest_to_right
# <int> <chr> <date> <date> <date> <date> <chr> <chr>
#1 1 female 2019-06-05 2015-03-11 NA 2018-11-10 var3 NA
#2 2 male NA 2016-12-09 2019-05-26 2019-06-23 NA NA
#3 3 male 2019-06-05 2019-08-31 2015-04-19 2016-12-17 var3 var1
#4 4 male 2019-06-05 NA 2015-09-19 2015-05-16 var2 NA
#5 5 male 2019-06-05 2015-07-23 NA NA var1 NA
这是一个基本的 R 解决方案。可能有更简单的方法。
nms <- names(df1[-(1:3)])
res <- apply(df1[-(1:2)], 1, \(x){
d <- difftime(x[1], x[-1])
if(any(!is.na(d))){
if(any(d > 0, na.rm = TRUE)) {
i <- which((d > 0) & (d == min(d[d > 0], na.rm = TRUE)))
closest_left <- nms[i]
} else closest_left <- NA
if(any(d < 0, na.rm = TRUE)){
j <- which((d < 0) & (d == min(d[d < 0], na.rm = TRUE)))
closest_right <- nms[j]
} else closest_right <- NA
c(closest_left = closest_left, closest_right = closest_right)
} else c(closest_left = NA, closest_right = NA)
})
res <- cbind(df1, t(res))
res
#> id gender reference_day var1 var2 var3 closest_left closest_right
#> 1 1 female 2019-06-05 2015-03-11 <NA> 2018-11-10 var3 <NA>
#> 2 2 male <NA> 2016-12-09 2019-05-26 2019-06-23 <NA> <NA>
#> 3 3 male 2019-06-05 2019-08-31 2015-04-19 2016-12-17 var3 var1
#> 4 4 male 2019-06-05 <NA> 2015-09-19 2015-05-16 var2 <NA>
#> 5 5 male 2019-06-05 2015-07-23 <NA> <NA> var1 <NA>
由 reprex package (v2.0.1)
创建于 2022-02-06
这是另一种tidyverse
方法:
- 首先我们计算每个变量与参考值的差异
- 引入长格式
- 删除 varname
中的 diff_
- 创建一个仅使用负值的辅助列
- 分组并排列
- 通过重新定义辅助列再次向左和向右识别
closest
:现在只有正值。
- 用
slice
. 填充最靠近 select 组第一行的两列
df1 %>%
mutate(across(contains("var"), ~ parse_number(as.character(. - reference_day)), .names = "diff_{.col}")) %>%
pivot_longer(cols = contains("diff")) %>%
mutate(name = str_remove(name, '\w+\_'),
helper = ifelse(value > 0, NA_real_, value)) %>%
group_by(id) %>%
arrange(desc(helper), .by_group = TRUE) %>%
mutate(closest_to_left = ifelse(helper == max(helper, na.rm = TRUE), name, NA_character_),
helper = ifelse(value < 0, NA_real_, value),
closest_to_right = ifelse(helper == min(helper, na.rm = TRUE), name, NA_character_)) %>%
fill(closest_to_left, .direction = "downup") %>%
fill(closest_to_right, .direction = "downup") %>%
slice(1) %>%
select(-c(name, value, helper))
id gender reference_day var1 var2 var3 closest_to_left closest_to_right
<int> <chr> <date> <date> <date> <date> <chr> <chr>
1 1 female 2019-06-05 2015-03-11 NA 2018-11-10 var3 NA
2 2 male NA 2016-12-09 2019-05-26 2019-06-23 NA NA
3 3 male 2019-06-05 2019-08-31 2015-04-19 2016-12-17 var3 var1
4 4 male 2019-06-05 NA 2015-09-19 2015-05-16 var2 NA
5 5 male 2019-06-05 2015-07-23 NA NA var1 NA
这是一个相对简单的 tidyverse 方法。首先,我们定义一个函数来选择每个组中 reference_day 之前或之后最接近的匹配项,然后我们在每种情况下应用该函数来添加两个新列。我使用 side
参数来定义我们是否希望在具有负时间差(之前)或正(之后)的一侧进行匹配。
closest <- function(df, side = -1) {
df %>%
pivot_longer(-c(id:reference_day)) %>%
group_by(id, gender) %>%
arrange(value) %>%
mutate(dif = (value - reference_day) * side) %>%
filter(dif > 0) %>%
slice_min(dif) %>%
select(name) %>%
ungroup()
}
df1 %>%
left_join(df1 %>% closest(-1) %>% rename("left" = "name")) %>%
left_join(df1 %>% closest(1) %>% rename("right" = "name"))
结果
id gender reference_day var1 var2 var3 left right
1 1 female 2019-06-05 2015-03-11 <NA> 2018-11-10 var3 <NA>
2 2 male <NA> 2016-12-09 2019-05-26 2019-06-23 <NA> <NA>
3 3 male 2019-06-05 2019-08-31 2015-04-19 2016-12-17 var3 var1
4 4 male 2019-06-05 <NA> 2015-09-19 2015-05-16 var2 <NA>
5 5 male 2019-06-05 2015-07-23 <NA> <NA> var1 <NA>
考虑以下随机 MWE。
对于每一行,我试图确定哪个变量的值最接近常量 reference_day 以及哪个变量的值最接近常量 reference_day.
df1 <-
structure(
list(id = 1:5,
gender = c("female", "male", "male", "male", "male"),
reference_day = structure(c(18052, NA, 18052, 18052, 18052), class = "Date"),
var1 = structure(c(16505, 17144, 18139, NA, 16639), class = "Date"),
var2 = structure(c(NA, 18042, 16544, 16697, NA), class = "Date"),
var3 = structure(c(17845, 18070, 17152, 16571, NA), class = "Date")),
row.names = c(NA, -5L), class = "data.frame")
df1
id gender reference_day var1 var2 var3
1 1 female 2019-06-05 2015-03-11 <NA> 2018-11-10
2 2 male <NA> 2016-12-09 2019-05-26 2019-06-23
3 3 male 2019-06-05 2019-08-31 2015-04-19 2016-12-17
4 4 male 2019-06-05 <NA> 2015-09-19 2015-05-16
5 5 male 2019-06-05 2015-07-23 <NA> <NA>
我要的结果是这样的:
id gender reference_day var1 var2 var3 closest_to_left closest_to_right
1 1 female 2019-06-05 2015-03-11 <NA> 2018-11-10 var3 <NA>
2 2 male <NA> 2016-12-09 2019-05-26 2019-06-23 <NA> <NA>
3 3 male 2019-06-05 2019-08-31 2015-04-19 2016-12-17 var3 var1
4 4 male 2019-06-05 <NA> 2015-09-19 2015-05-16 var2 <NA>
5 5 male 2019-06-05 2015-07-23 <NA> <NA> var1 <NA>
经过多次尝试和错误,我实际上能够使用 dplyr 的 case_when 函数找到解决方案,但它需要大量的样板代码,这让我认为只有必须是一个更聪明的解决方案。
我个人更喜欢使用 dplyr,但非常感谢任何帮助。
执行此操作的自定义函数 -
library(dplyr)
cols <- df1 %>% select(starts_with('var')) %>% names
closest_to_right <- function(x, y) {
tmp <- y - x
if(any(tmp > 0, na.rm = TRUE))
cols[tmp %in% min(tmp[tmp > 0], na.rm = TRUE)] else NA
}
closest_to_left <- function(x, y) {
tmp <- y - x
if(any(tmp < 0, na.rm = TRUE))
cols[tmp %in% max(tmp[tmp < 0], na.rm = TRUE)] else NA
}
df1 %>%
rowwise() %>%
mutate(closest_to_left = closest_to_left(reference_day, c_across(starts_with('var'))),
closest_to_right = closest_to_right(reference_day, c_across(starts_with('var')))) %>%
ungroup
# id gender reference_day var1 var2 var3 closest_to_left closest_to_right
# <int> <chr> <date> <date> <date> <date> <chr> <chr>
#1 1 female 2019-06-05 2015-03-11 NA 2018-11-10 var3 NA
#2 2 male NA 2016-12-09 2019-05-26 2019-06-23 NA NA
#3 3 male 2019-06-05 2019-08-31 2015-04-19 2016-12-17 var3 var1
#4 4 male 2019-06-05 NA 2015-09-19 2015-05-16 var2 NA
#5 5 male 2019-06-05 2015-07-23 NA NA var1 NA
这是一个基本的 R 解决方案。可能有更简单的方法。
nms <- names(df1[-(1:3)])
res <- apply(df1[-(1:2)], 1, \(x){
d <- difftime(x[1], x[-1])
if(any(!is.na(d))){
if(any(d > 0, na.rm = TRUE)) {
i <- which((d > 0) & (d == min(d[d > 0], na.rm = TRUE)))
closest_left <- nms[i]
} else closest_left <- NA
if(any(d < 0, na.rm = TRUE)){
j <- which((d < 0) & (d == min(d[d < 0], na.rm = TRUE)))
closest_right <- nms[j]
} else closest_right <- NA
c(closest_left = closest_left, closest_right = closest_right)
} else c(closest_left = NA, closest_right = NA)
})
res <- cbind(df1, t(res))
res
#> id gender reference_day var1 var2 var3 closest_left closest_right
#> 1 1 female 2019-06-05 2015-03-11 <NA> 2018-11-10 var3 <NA>
#> 2 2 male <NA> 2016-12-09 2019-05-26 2019-06-23 <NA> <NA>
#> 3 3 male 2019-06-05 2019-08-31 2015-04-19 2016-12-17 var3 var1
#> 4 4 male 2019-06-05 <NA> 2015-09-19 2015-05-16 var2 <NA>
#> 5 5 male 2019-06-05 2015-07-23 <NA> <NA> var1 <NA>
由 reprex package (v2.0.1)
创建于 2022-02-06这是另一种tidyverse
方法:
- 首先我们计算每个变量与参考值的差异
- 引入长格式
- 删除 varname 中的
- 创建一个仅使用负值的辅助列
- 分组并排列
- 通过重新定义辅助列再次向左和向右识别
closest
:现在只有正值。 - 用
slice
. 填充最靠近 select 组第一行的两列
diff_
df1 %>%
mutate(across(contains("var"), ~ parse_number(as.character(. - reference_day)), .names = "diff_{.col}")) %>%
pivot_longer(cols = contains("diff")) %>%
mutate(name = str_remove(name, '\w+\_'),
helper = ifelse(value > 0, NA_real_, value)) %>%
group_by(id) %>%
arrange(desc(helper), .by_group = TRUE) %>%
mutate(closest_to_left = ifelse(helper == max(helper, na.rm = TRUE), name, NA_character_),
helper = ifelse(value < 0, NA_real_, value),
closest_to_right = ifelse(helper == min(helper, na.rm = TRUE), name, NA_character_)) %>%
fill(closest_to_left, .direction = "downup") %>%
fill(closest_to_right, .direction = "downup") %>%
slice(1) %>%
select(-c(name, value, helper))
id gender reference_day var1 var2 var3 closest_to_left closest_to_right
<int> <chr> <date> <date> <date> <date> <chr> <chr>
1 1 female 2019-06-05 2015-03-11 NA 2018-11-10 var3 NA
2 2 male NA 2016-12-09 2019-05-26 2019-06-23 NA NA
3 3 male 2019-06-05 2019-08-31 2015-04-19 2016-12-17 var3 var1
4 4 male 2019-06-05 NA 2015-09-19 2015-05-16 var2 NA
5 5 male 2019-06-05 2015-07-23 NA NA var1 NA
这是一个相对简单的 tidyverse 方法。首先,我们定义一个函数来选择每个组中 reference_day 之前或之后最接近的匹配项,然后我们在每种情况下应用该函数来添加两个新列。我使用 side
参数来定义我们是否希望在具有负时间差(之前)或正(之后)的一侧进行匹配。
closest <- function(df, side = -1) {
df %>%
pivot_longer(-c(id:reference_day)) %>%
group_by(id, gender) %>%
arrange(value) %>%
mutate(dif = (value - reference_day) * side) %>%
filter(dif > 0) %>%
slice_min(dif) %>%
select(name) %>%
ungroup()
}
df1 %>%
left_join(df1 %>% closest(-1) %>% rename("left" = "name")) %>%
left_join(df1 %>% closest(1) %>% rename("right" = "name"))
结果
id gender reference_day var1 var2 var3 left right
1 1 female 2019-06-05 2015-03-11 <NA> 2018-11-10 var3 <NA>
2 2 male <NA> 2016-12-09 2019-05-26 2019-06-23 <NA> <NA>
3 3 male 2019-06-05 2019-08-31 2015-04-19 2016-12-17 var3 var1
4 4 male 2019-06-05 <NA> 2015-09-19 2015-05-16 var2 <NA>
5 5 male 2019-06-05 2015-07-23 <NA> <NA> var1 <NA>