连接或合并数据帧时,处理具有 "no value" 的字段以指示所有值都是可能的,最好的方法是什么?

When joining or merging data frames, what's the best way to deal with a field that has "no value" to indicate that all values are possible?

这提供了一个例子。有没有办法在 R 中处理这个以获得 main_df_2 的预期结果。或者,有没有一种方法可以将行添加到 lookup_df 中,从而使 (division ='d2') 具有 (parent = 'P') 和 (parent =[=16) 的缺失可能性=]) 除了一个空行之外还添加了两行?

# example for the type of join 
library(tidyverse)
main_df <- data.frame( division = c('d1', 'd1', 'd2', 'd2'),
                       parent = c('P', 'C', 'P', 'C'))
main_df
# > main_df
# division parent 
# 1       d1      P
# 2       d1      C       
# 3       d2      P
# 4       d2      C
lookup_df <- data.frame( division = c('d1', 'd1', 'd2'),
                  parent = c('P', 'C', ''),
                  plant = c('A', 'B', 'B'))
lookup_df
# > lookup_df
# division parent plant
# 1       d1      P     A
# 2       d1      C     B
# 3       d2            B

# desired outcome 
# > main_df_2
# division parent plant
#    d1      P      A
#    d1      C      B
#    d2      P      B
#    d2      C      B

main_df_2 <- left_join(main_df, lookup_df,
                by = c("division" = "division",
                       "parent" = "parent"))

main_df_2
# > x1
# division parent plant
# 1       d1      P     A
# 2       d1      C     B
# 3       d2         <NA>
# 4       d2      C  <NA>

这里有 2 个选项使用 data.table

1) 在行绑定之前拆分为 2 个单独的连接:

library(data.table)
setDT(main_df)
setDT(lookup_df)

rbindlist(list(
    main_df[lookup_df, on=.(division, parent), nomatch=0L],
    main_df[lookup_df[parent==""], on=.(division), nomatch=0L, 
        .(division=x.division, parent=x.parent, plant=i.plant)]))

2) 在过滤前使用完全输出合并(如果加入后数据集很大,则需要更多内存):

setnames(
    merge(lookup_df, main_df, by="division", all=TRUE)[
        parent.x==parent.y | parent.x==""][, 
            parent.x := NULL],
    "parent.y", "parent")

输出:

   division parent plant
1:       d1      P     A
2:       d1      C     B
3:       d2      P     B
4:       d2      C     B