如何检测 R 中数据框中给定参考变量下方和上方的最接近值?

How to detect the closest value below and above a given reference variable in a data frame in R?

考虑以下随机 MWE。

对于每一行,我试图确定哪个变量的值最接近常量 reference_day 以及哪个变量的值最接近常量 reference_day.

df1 <-
  structure(
    list(id = 1:5,
      gender = c("female", "male", "male", "male", "male"),
      reference_day = structure(c(18052, NA, 18052, 18052, 18052), class = "Date"),
      var1 = structure(c(16505, 17144, 18139, NA, 16639), class = "Date"),
      var2 = structure(c(NA, 18042, 16544, 16697, NA), class = "Date"),
      var3 = structure(c(17845, 18070, 17152, 16571, NA), class = "Date")),
  row.names = c(NA, -5L), class = "data.frame")

df1

  id gender reference_day       var1       var2       var3
1  1 female    2019-06-05 2015-03-11       <NA> 2018-11-10
2  2   male          <NA> 2016-12-09 2019-05-26 2019-06-23
3  3   male    2019-06-05 2019-08-31 2015-04-19 2016-12-17
4  4   male    2019-06-05       <NA> 2015-09-19 2015-05-16
5  5   male    2019-06-05 2015-07-23       <NA>       <NA>

我要的结果是这样的:

  id gender reference_day       var1       var2       var3 closest_to_left closest_to_right
1  1 female    2019-06-05 2015-03-11       <NA> 2018-11-10            var3             <NA>
2  2   male          <NA> 2016-12-09 2019-05-26 2019-06-23            <NA>             <NA>
3  3   male    2019-06-05 2019-08-31 2015-04-19 2016-12-17            var3             var1
4  4   male    2019-06-05       <NA> 2015-09-19 2015-05-16            var2             <NA>
5  5   male    2019-06-05 2015-07-23       <NA>       <NA>            var1             <NA>

经过多次尝试和错误,我实际上能够使用 dplyr 的 case_when 函数找到解决方案,但它需要大量的样板代码,这让我认为只有必须是一个更聪明的解决方案。

我个人更喜欢使用 dplyr,但非常感谢任何帮助。

执行此操作的自定义函数 -

library(dplyr)

cols <- df1 %>% select(starts_with('var')) %>% names

closest_to_right <- function(x, y) {
  tmp <- y - x
  if(any(tmp > 0, na.rm = TRUE)) 
     cols[tmp %in% min(tmp[tmp > 0], na.rm = TRUE)] else NA
}

closest_to_left <- function(x, y) {
  tmp <- y - x
  if(any(tmp < 0, na.rm = TRUE)) 
     cols[tmp %in% max(tmp[tmp < 0], na.rm = TRUE)] else NA
}

df1 %>%
  rowwise() %>%
  mutate(closest_to_left = closest_to_left(reference_day, c_across(starts_with('var'))),
         closest_to_right = closest_to_right(reference_day, c_across(starts_with('var')))) %>%
  ungroup

#     id gender reference_day var1       var2       var3       closest_to_left closest_to_right
#  <int> <chr>  <date>        <date>     <date>     <date>     <chr>           <chr>           
#1     1 female 2019-06-05    2015-03-11 NA         2018-11-10 var3            NA              
#2     2 male   NA            2016-12-09 2019-05-26 2019-06-23 NA              NA              
#3     3 male   2019-06-05    2019-08-31 2015-04-19 2016-12-17 var3            var1            
#4     4 male   2019-06-05    NA         2015-09-19 2015-05-16 var2            NA              
#5     5 male   2019-06-05    2015-07-23 NA         NA         var1            NA        

这是一个基本的 R 解决方案。可能有更简单的方法。

nms <- names(df1[-(1:3)])
res <- apply(df1[-(1:2)], 1, \(x){
  d <- difftime(x[1], x[-1])
  if(any(!is.na(d))){
    if(any(d > 0, na.rm = TRUE)) {
      i <- which((d > 0) & (d == min(d[d > 0], na.rm = TRUE)))
      closest_left <- nms[i]
    } else closest_left <- NA
    if(any(d < 0, na.rm = TRUE)){
      j <- which((d < 0) & (d == min(d[d < 0], na.rm = TRUE)))
      closest_right <- nms[j]
    } else closest_right <- NA
    c(closest_left = closest_left, closest_right = closest_right)
  } else c(closest_left = NA, closest_right = NA)
})

res <- cbind(df1, t(res))
res
#>   id gender reference_day       var1       var2       var3 closest_left closest_right
#> 1  1 female    2019-06-05 2015-03-11       <NA> 2018-11-10         var3          <NA>
#> 2  2   male          <NA> 2016-12-09 2019-05-26 2019-06-23         <NA>          <NA>
#> 3  3   male    2019-06-05 2019-08-31 2015-04-19 2016-12-17         var3          var1
#> 4  4   male    2019-06-05       <NA> 2015-09-19 2015-05-16         var2          <NA>
#> 5  5   male    2019-06-05 2015-07-23       <NA>       <NA>         var1          <NA>

reprex package (v2.0.1)

创建于 2022-02-06

这是另一种tidyverse方法:

  1. 首先我们计算每个变量与参考值的差异
  2. 引入长格式
  3. 删除 varname
  4. 中的 diff_
  5. 创建一个仅使用负值的辅助列
  6. 分组并排列
  7. 通过重新定义辅助列再次向左和向右识别 closest:现在只有正值。
  8. slice.
  9. 填充最靠近 select 组第一行的两列
df1 %>% 
  mutate(across(contains("var"), ~ parse_number(as.character(. - reference_day)), .names = "diff_{.col}")) %>% 
  pivot_longer(cols = contains("diff")) %>% 
  mutate(name = str_remove(name, '\w+\_'),
         helper = ifelse(value > 0, NA_real_, value)) %>% 
  group_by(id) %>% 
  arrange(desc(helper), .by_group = TRUE) %>% 
  mutate(closest_to_left = ifelse(helper == max(helper, na.rm = TRUE), name, NA_character_),
         helper = ifelse(value < 0, NA_real_, value),
         closest_to_right = ifelse(helper == min(helper, na.rm = TRUE), name, NA_character_)) %>% 
  fill(closest_to_left, .direction = "downup") %>% 
  fill(closest_to_right, .direction = "downup") %>% 
  slice(1) %>% 
  select(-c(name, value, helper))
     id gender reference_day var1       var2       var3       closest_to_left closest_to_right
  <int> <chr>  <date>        <date>     <date>     <date>     <chr>           <chr>           
1     1 female 2019-06-05    2015-03-11 NA         2018-11-10 var3            NA              
2     2 male   NA            2016-12-09 2019-05-26 2019-06-23 NA              NA              
3     3 male   2019-06-05    2019-08-31 2015-04-19 2016-12-17 var3            var1            
4     4 male   2019-06-05    NA         2015-09-19 2015-05-16 var2            NA              
5     5 male   2019-06-05    2015-07-23 NA         NA         var1            NA  

这是一个相对简单的 tidyverse 方法。首先,我们定义一个函数来选择每个组中 reference_day 之前或之后最接近的匹配项,然后我们在每种情况下应用该函数来添加两个新列。我使用 side 参数来定义我们是否希望在具有负时间差(之前)或正(之后)的一侧进行匹配。

closest <- function(df, side = -1) {
  df %>%
    pivot_longer(-c(id:reference_day)) %>%
    group_by(id, gender) %>%
    arrange(value) %>%
    mutate(dif = (value - reference_day) * side) %>%
    filter(dif > 0) %>%
    slice_min(dif) %>%
    select(name) %>%
    ungroup()
}

df1 %>%
  left_join(df1 %>% closest(-1) %>% rename("left" = "name")) %>%
  left_join(df1 %>% closest(1) %>% rename("right" = "name"))

结果

  id gender reference_day       var1       var2       var3 left right
1  1 female    2019-06-05 2015-03-11       <NA> 2018-11-10 var3  <NA>
2  2   male          <NA> 2016-12-09 2019-05-26 2019-06-23 <NA>  <NA>
3  3   male    2019-06-05 2019-08-31 2015-04-19 2016-12-17 var3  var1
4  4   male    2019-06-05       <NA> 2015-09-19 2015-05-16 var2  <NA>
5  5   male    2019-06-05 2015-07-23       <NA>       <NA> var1  <NA>