R 条件连接

R conditional join

有没有办法在 R 中加入和更新列?示例:

tbl1 <- tibble(ID = LETTERS[1:3],
       VAL = rep(NA, 3),
       tbl1_df = list(tibble(A = rnorm(3),
                             B = rnorm(3))))

tbl2 <- tibble(ID = LETTERS[1:3],
               VAL = c(1, 2, 3),
               tbl2_df = list(tibble(A = rnorm(3),
                                     B = rnorm(3))))

tbl3 <- tibble(ID = LETTERS[1:3],
               VAL = c(1, 2, 3),
               tbl3_df = list(tibble(A = rnorm(3),
                                     B = rnorm(3))))

我想将这些 tibble 连接在一起,并使用具有值的 table 之一更新 VAL。表格在 VAL 中始终具有相同的值,但我并不总是知道它们在哪个 table 中。是否可以将 VAL 列强制在一起或将 VAL 列与存在值的小标题之一分开?

答案应如下所示,如前所述,table VAL 列来自哪个无关紧要,tables 具有相同的 VAL 或 NA。

tibble(ID = LETTERS[1:3],
                 VAL = c(1, 2, 3),
                 tbl1_df = list(tibble(A = rnorm(3),
                                       B = rnorm(3))),
                 tbl2_df = list(tibble(A = rnorm(3),
                                       B = rnorm(3))),
                 tbl3_df = list(tibble(A = rnorm(3),
                                       B = rnorm(3))))

# A tibble: 3 x 5
  ID      VAL tbl1_df          tbl2_df          tbl3_df         
  <chr> <dbl> <list>           <list>           <list>          
1 A        1. <tibble [3 x 2]> <tibble [3 x 2]> <tibble [3 x 2]>
2 B        2. <tibble [3 x 2]> <tibble [3 x 2]> <tibble [3 x 2]>
3 C        3. <tibble [3 x 2]> <tibble [3 x 2]> <tibble [3 x 2]>

这个怎么样?

library(purrr)

list(tbl1, tbl2, tbl3) %>% 
  reduce(full_join, by = "ID") %>%   #merge all tables
  select_if(~!all(is.na(.))) %>%     #drop columns having all NA value
  select(-starts_with("VAL."))       #keep only one 'VAL' column and drop remaining repetitive columns

这给出了

# A tibble: 3 x 5
  ID    tbl1_df          tbl2_df            VAL tbl3_df         
  <chr> <list>           <list>           <dbl> <list>          
1 A     <tibble [3 x 2]> <tibble [3 x 2]>  1.00 <tibble [3 x 2]>
2 B     <tibble [3 x 2]> <tibble [3 x 2]>  2.00 <tibble [3 x 2]>
3 C     <tibble [3 x 2]> <tibble [3 x 2]>  3.00 <tibble [3 x 2]>

基于 Jaap 的评论,您可以使用 purrr 的 reduce 命令和 dplyr 的 full_join 将小标题组合成一个小标题。 那么问题是如何只获取存在的 VAL,而不是为 VAL 设置 3 列,但并非所有列都有数据。一种简单的方法是使用 dplyr 的 coalesce 命令,它采用第一个非缺失值。此步骤中引入的一个问题是,如果数据类型均为 NA,则数据类型为 BOOLEAN,因此使用 as.numeric 解决了这个问题。最后,删除后面添加了字母的其他 VAL 列。

library(dplyr)
library(purrr)

reduce(list(tbl1, tbl2, tbl3), full_join, by = "ID") %>% # Combine the tibbles into a single tibble
  mutate(VAL= coalesce(as.numeric(VAL.x), as.numeric(VAL.y), as.numeric(VAL))) %>% # Create a variable for VAL which takes the first non missing using the coalesce function
  select(-starts_with("Val.")) # Delete the columns for VAL which were created when joining and have a name of VAL. and then a letter