在 data.frame 上应用函数，并使用来自另一个 data.frame 的相同列进行变异

Question

我有两个来自卫星的光谱带数据帧，redDF 和 nirDF。两个数据框都有以 'X' 开头的每个日期列的值，这些名称在两个数据框中对应。我想获得一个新的数据框，其中对于 redDF 和 nirDF 中以 'X' 开头的每一列，根据某个公式计算一个新值。

这是一个数据样本：

library(dplyr)
set.seed(999)
# get column names
datecolnames <- seq(as.Date("2015-05-01", "%Y-%m-%d"),
           as.Date("2015-09-20", "%Y-%m-%d"),
           by="16 days") %>% 
  format(., "%Y-%m-%d") %>% 
  paste0("X", .)
# sample data values 
mydata <- as.integer(runif(length(datecolnames))*1000)
# sample no data indices
nodata <- sample(1:length(datecolnames), length(datecolnames)*0.3)
mydata[nodata] <- NA # assign no data to the correct indices

# get dummy data.frame of red spectral values
redDF <- data.frame(mydata,
           mydata[sample(1:length(mydata))],
           mydata[sample(1:length(mydata))]) %>% 
  t() %>% 
  as.data.frame(., row.names = FALSE) %>% 
  rename_with(~datecolnames) %>% 
  mutate(id = row_number()+1142) %>% 
  select(id, everything())

# get dummy data.frame of near infrared spectral values
# in this case a modified version of redDF
nirDF <- redDF %>% 
  mutate(across(-id,~as.integer(.x+20*1.8))) %>% 
  select(id, everything())

> nirDF
    id X2015-05-01 X2015-05-17 X2015-06-02 X2015-06-18 X2015-07-04 X2015-07-20 X2015-08-05
1 1143          NA         645          NA         636         569         841         706
2 1144        1025          NA         706         569         354          NA          NA
3 1145         904         636         706         645          NA          NA         115
  X2015-08-21 X2015-09-06 X2015-09-22 X2015-10-08 X2015-10-24 X2015-11-09
1         115        1025         904          NA         409         354
2         115         636         409         645         841         904
3         569         409         354         841        1025          NA

这是公式：

getNDVI <- function(red, nir){round((nir - red)/(nir + red), digits = 4)}

我希望我能做类似的事情：

ndviDF <- redDF %>% mutate(across(starts_with('X'), .fns = getNDVI))

但这不起作用，因为 dplyr 不知道 getNDVI 的 nir 参数应该是什么。我已经看到使用 $COLNAME 索引器访问 mutate() 中的其他数据帧的解决方案，但由于我有 197 列，所以这不是一个选项。

Answer 1

我会用 for 循环来解决这个问题，尽管我知道它没有充分利用像 across.

这样的功能

首先，我们创建一个要迭代的列的列表：

cols_to_iterate_over = redDF %>%
  select(starts_with("X") %>%
  colnames()

然后我们加入 id 并确保根据源数据集命名列：

joined_df = redDF %>%
  inner_join(nirDF, by = "id", prefix = c("_red","_nir"))

所以 joined_df 应该有这样的列：

id X2015-05-01_red X2015-05-01_NIR X2015-05-17_red X2015-05-17_NIR ...

然后我们可以遍历这些：

for(col in cols_to_iterate_over){
  # columns for calculation
  red_col = paste0(col,"_red") %>% sym()
  nir_col = paste0(col,"_nir") %>% sym()
  out_col = col %>% sym()
  
# calculate
  joined_df = joined_df %>%
    mutate(
      !!out_col := round((!!nir_col - !!red_col)/(!!nir_col + !!red_col),
                         digits = 4)
    ) %>%
    select(-!!red_col, -!!nir_col)
}

说明：如果我们将文本字符串转化为符号，然后!!它们，我们就可以使用文本字符串作为变量名。

sym()把文字变成符号，
!! 内部 dplyr 命令将符号转换为代码，
和 := 等同于 = 但允许我们在 left-hand 一侧有 !!。

抱歉，这是有点旧的语法。对于当前的方法，请参阅 programming with dplyr。

Answer 2

在最基本的形式中，您可以这样做：

round((nirDF - redDF)/(nirDF + redDF), digits = 4)

但这不会保留 id-column，如果某些列不是数字，则可能会中断。一个更安全的版本是：

red <- redDF %>% 
  arrange(id) %>%  # be sure to apply the same order everywhere
  select(starts_with('X')) %>%  
  mutate(across(everything(), as.numeric)) # be sure to have numeric columns 
nir <- nirDF %>% arrange(id) %>% 
  select(starts_with('X')) %>%  
  mutate(across(everything(), as.numeric))

# make sure that the number of rows are equal
if(nrow(red) == nrow(nir)){
  ndvi <- redDF %>% 
    # get data.frame with ndvi values
    transmute(round((nir - red)/(nir + red), digits = 4)) %>% 
    # bind id-column and possibly other columns to the data frame
    bind_cols(redDF %>% arrange(id) %>% select(!starts_with('X'))) %>% 
    # place the id-column to the front
    select(!starts_with('X'), everything())
}

据我目前的理解dplyr，归结为：

across（通常）用于 many-to-many 关系，但默认情况下会单独处理列。因此，如果您给它三列，它会返回三列，而这些列不知道其他列中的值。
c_across 可以评估列之间的关系（如总和或标准差），但适用于 many-to-one 关系。换句话说，如果你给它三列，它会还给你一列。

这些都不适合这个任务。但是，根据设计，算术运算可以应用于 R 中的数据帧（例如，只需尝试 cars*cars）。这就是我们在这种情况下所需要的。幸运的是，这些操作不像 dplyr join 操作那样贪婪，因此它们可以在大型数据帧上高效地完成。这样做时，您需要考虑一些要求：

两个数据框的行数应该相等，否则较短的数据框会被回收
数据框中的所有列都必须是数字 class（numeric 或 integer）。

在 data.frame 上应用函数，并使用来自另一个 data.frame 的相同列进行变异

Apply function on data.frame with mutate across using the same columns from another data.frame

r

dplyr

across