排除列时交叉变异的正确语法，第 2 部分

Question

我以为我找到了问题的答案，但是当我使用更大的数据集时，我得到了不同的结果。我怀疑差异是因为 na.locf 行的行为方式。

基本上，我将使用 mutate_at 的代码转换为使用 mutate(across()) 的新语法。

在下面的第一种情况下，数据填充正确，因为 df_initial 仍按 index_name 分组。在第二种情况下，我假设因为我必须取消分组才能使 mutate across 工作，所以我得到了不同的答案。

所以这里再举一个更大数据集的例子来说明问题。

可重现的例子：

df_initial <- 
structure(list(Date = structure(c(18681, 18681, 18681, 18681, 
                                  18682, 18682, 18682, 18682, 18683, 18683, 18683, 18683, 18684, 
                                  18684, 18684, 18684, 18685, 18685, 18685, 18685, 18686, 18686, 
                                  18686, 18686), class = "Date"), index_name = c("INDU Index", 
                                                                                 "SPX Index", "TPX Index", "MEXBOL Index", "INDU Index", "SPX Index", 
                                                                                 "TPX Index", "MEXBOL Index", "INDU Index", "SPX Index", "TPX Index", 
                                                                                 "MEXBOL Index", "INDU Index", "SPX Index", "TPX Index", "MEXBOL Index", 
                                                                                 "INDU Index", "SPX Index", "TPX Index", "MEXBOL Index", "INDU Index", 
                                                                                 "SPX Index", "TPX Index", "MEXBOL Index"), index_level = c(31537.35, 
                                                                                                                                            3881.37, NA, 45268.33, 31961.86, 3925.43, 1903.07, 45151.38, 
                                                                                                                                            31402.01, 3829.34, 1926.23, 44310.27, 30932.37, 3811.15, 1864.49, 
                                                                                                                                            44592.91, NA, NA, NA, NA, NA, NA, NA, NA), totalReturn_daily = c(0.0497, 
                                                                                                                                                                                                             0.1277, 0, 0.7158, 1.3461, 1.1364, -1.8201, -0.1151, -1.7181, 
                                                                                                                                                                                                             -2.4339, 1.2411, -1.8629, -1.4628, -0.4636, -3.2052, 0.6379, 
                                                                                                                                                                                                             0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, -24L), groups = structure(list(
                                                                                                                                                                                                               index_name = c("INDU Index", "MEXBOL Index", "SPX Index", 
                                                                                                                                                                                                                              "TPX Index"), .rows = structure(list(c(1L, 5L, 9L, 13L, 17L, 
                                                                                                                                                                                                                                                                     21L), c(4L, 8L, 12L, 16L, 20L, 24L), c(2L, 6L, 10L, 14L, 
                                                                                                                                                                                                                                                                                                            18L, 22L), c(3L, 7L, 11L, 15L, 19L, 23L)), ptype = integer(0), class = c("vctrs_list_of", 
                                                                                                                                                                                                                                                                                                                                                                                     "vctrs_vctr", "list"))), row.names = c(NA, -4L), class = c("tbl_df", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                "tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               "tbl_df", "tbl", "data.frame"))

下面的第一种方法给出了正确的值，但下面的第二种方法没有。所以我试图在方法 #2 中得到相同的答案，而我在方法 #1 中得到。

# Approach 1: Expected output received here:
df_initial %>%
  mutate_at(vars(-index_name, -totalReturn_daily),
            ~ na.locf(., na.rm = FALSE)) %>%
  filter(index_name == "TPX Index")

# Output
  Date       index_name index_level totalReturn_daily
  <date>     <chr>            <dbl>             <dbl>
1 2021-02-23 TPX Index          NA               0   
2 2021-02-24 TPX Index        1903.             -1.82
3 2021-02-25 TPX Index        1926.              1.24
4 2021-02-26 TPX Index        1864.             -3.21
5 2021-02-27 TPX Index        1864.              0   
6 2021-02-28 TPX Index        1864.              0  

# Approach 2: Did not receive expected output here
df_initial %>%
  ungroup() %>%
  mutate(across(
    .cols = -c(index_name, totalReturn_daily),
    .fns  = ~ na.locf(., na.rm = FALSE)
  )) %>%
  filter(index_name == "TPX Index")

# Output
  Date       index_name index_level totalReturn_daily
  <date>     <chr>            <dbl>             <dbl>
1 2021-02-23 TPX Index        3881.              0   
2 2021-02-24 TPX Index        1903.             -1.82
3 2021-02-25 TPX Index        1926.              1.24
4 2021-02-26 TPX Index        1864.             -3.21
5 2021-02-27 TPX Index       44593.              0   
6 2021-02-28 TPX Index       44593.              0

谢谢！

Answer 1

这两种方法都为我提供了相似的结果。你能试试下面的代码吗？

library(zoo)
df_initial %>%
  group_by(index_name) %>% 
  mutate_at(vars(-index_name, -totalReturn_daily),
            ~ na.locf(., na.rm = FALSE)) %>% 
  dplyr::filter(index_name == "TPX Index") 


df_initial %>%
  group_by(index_name) %>% 
  mutate(across(
    .cols = -c(totalReturn_daily),
    .fns  = ~ na.locf(., na.rm = FALSE)
  )) %>%
  ungroup() %>% 
  dplyr::filter(index_name == "TPX Index")

排除列时交叉变异的正确语法，第 2 部分

Correct syntax for mutate across when excluding columns, part 2

r

dplyr

across