根据特定列检测两个数据帧中的不匹配

Detect mismatches in two data frames based on a specific column

我必须使用两个独立评分者的评分数据框。在 x 列中,对具有特定参考 ID (Ref.ID) 的论文的出版年份进行了编码。对于某些论文,对多个样本进行了编码。此信息反映在变量 "Sample.ID" 中(例如,在 df1 中有三个样本编码为 Ref.ID "C")。参考 ID 和样本 ID 的组合在变量 "Ref.Sample.ID" 中表示。我想知道 Ref.Sample.ID 变量 x 的编码在 df1 和 df2 之间是不同的。请注意,df2 比 df1 少一行,因为 df2 中的评分者只为 Ref.ID "C" 编码了两个样本,而 df1 中的评分者编码了三个样本。

我试图找到一个 R 代码,它会暴露 df1 和 df2 之间的不匹配。可能会出现不匹配,因为每个 Ref.ID 编码的行数不同,或者因为相同 Ref.Sample.ID.

的 df1 和 df2 之间的 x 不同

有谁知道怎么做最好?我对每一个提示都很满意:)

df1 <- read.table(text="
  Ref.ID    Sample.ID    Ref.Sample.ID     x       y
  A         1            A-1               2000    a    
  B         1            B-1               1992    a
  C         1            C-1               2018    b 
  C         2            C-2               2018    b   
  C         3            C-3               2018    b   
  D         1            D-1               2011    c 
  D         1            D-1               2011    c
  E         1            E-1               1990    a      
  F         1            F-1               1990    c   
  G         1            G-1               2015    d   
  G         2            G-2               2015    d    
  G         3            G-3               2015    d", header=TRUE)

# Note df2 has one row less than df1!

df2 <- read.table(text="
  Ref.ID    Sample.ID    Ref.Sample.ID     x       y     
  A         1            A-1               2000    a   
  B         1            B-1               1992    a
  C         1            C-1               2018    b
  C         2            C-2               2018    b   
  D         1            D-1               2011    a 
  D         2            D-2               2011    a
  E         1            E-1               1991    a       
  F         1            F-1               1990    d   
  G         1            G-1               2011    d    
  G         2            G-2               2011    d     
  G         3            G-3               2011    c", header=TRUE)

最终结果应该是Ref.Sample.ID的不同向量,其中df1和df2在x或y上存在差异。

例如 对于 x: "C-3" "E-1" "G-1" "G-2" "G-3" "D-2"

对于你: "C-3" "D-1" "F-1" "G-3" "D-2"

这将同时使用 tidyrdplyr

您可以先 pivot_longer 两个数据框,这样您将有一个单独的行用于 xy 进行比较。然后使用 anti_join 来找出 2 个数据帧之间的差异。这将检查任一数据框中的 extra/missing/different 行。

最后,要获得最终结果,您可以按 xy、select Ref.Sample.ID 作为您感兴趣的列进行筛选,并 distinct() 删除重复项。如果您希望所有结果都放在一个数据框中,另一种方法是使用 group_by(var) 而不是 filter

library(tidyverse)

df1_long <- pivot_longer(df1, cols = c(x, y), names_to = "var", values_to = "val", values_ptypes = list(val = 'character'))
df2_long <- pivot_longer(df2, cols = c(x, y), names_to = "var", values_to = "val", values_ptypes = list(val = 'character'))

df_diff <- bind_rows(anti_join(df1_long, df2_long), anti_join(df2_long, df1_long))

df_diff %>%
  filter(var == "x") %>%
  select(Ref.Sample.ID) %>%
  distinct()

输出

# A tibble: 6 x 1
  Ref.Sample.ID
  <chr>        
1 C-3          
2 E-1          
3 G-1          
4 G-2          
5 G-3          
6 D-2 

数据

df1 <- structure(list(Ref.ID = c("A", "B", "C", "C", "C", "D", "D", 
"E", "F", "G", "G", "G"), Sample.ID = c(1L, 1L, 1L, 2L, 3L, 1L, 
1L, 1L, 1L, 1L, 2L, 3L), Ref.Sample.ID = c("A-1", "B-1", "C-1", 
"C-2", "C-3", "D-1", "D-1", "E-1", "F-1", "G-1", "G-2", "G-3"
), x = c(2000L, 1992L, 2018L, 2018L, 2018L, 2011L, 2011L, 1990L, 
1990L, 2015L, 2015L, 2015L), y = c("a", "a", "b", "b", "b", "c", 
"c", "a", "c", "d", "d", "d")), class = "data.frame", row.names = c(NA, 
-12L))

df2 <- structure(list(Ref.ID = c("A", "B", "C", "C", "D", "D", "E", 
"F", "G", "G", "G"), Sample.ID = c(1L, 1L, 1L, 2L, 1L, 2L, 1L, 
1L, 1L, 2L, 3L), Ref.Sample.ID = c("A-1", "B-1", "C-1", "C-2", 
"D-1", "D-2", "E-1", "F-1", "G-1", "G-2", "G-3"), x = c(2000L, 
1992L, 2018L, 2018L, 2011L, 2011L, 1991L, 1990L, 2011L, 2011L, 
2011L), y = c("a", "a", "b", "b", "a", "a", "a", "d", "d", "d", 
"c")), class = "data.frame", row.names = c(NA, -11L))