如何逐行比较两个数据帧?

How to compare two data frames line by line?

大约一个月前,我发布了原始问题,我需要逐行比较两个数据帧并标记 df2(第二个文件)中与 df1(第一个文件)不匹配的行。解决方案是使用反连接。在我添加一个带有文本字符串的附加列之前,这非常有效。我还需要将该列包含在比较中并检测文本字符串的哪些记录不匹配。

附件是示例数据框。我需要将 df2 与 df1 进行比较,并显示 df2 中的哪些行与 df1 不匹配。我可以使用 R 中的反连接来显示哪些行不匹配,但是当我在行中有文本字符串时它不起作用。

df1

Product basecode    A   B   C   D   E   F
Tractor A810     382    512 363 553 530 A dog ran fast
Tractor A773     222    155 650 278 215 A dog ran fast
Tractor A203     382    512 363 553 530 A dog ran fast
Tractor A329     332    459 251 341 475 A dog ran fast
Combine B244     244    714 467 122 340 A dog ran fast
Combine B302     257    758 230 704 715 A dog ran fast
Combine B681     670    626 572 795 323 A dog ran fast
Combine B514     768    510 546 542 582 A dog ran fast
Sprayer C850     553    624 557 660 337 A dog ran fast
Sprayer C202     561    733 443 107 526 A dog ran fast
Sprayer C619     256    226 257 770 633 A dog ran fast
Sprayer C292     256    226 257 770 633 A dog ran fast
SPFH    D126     323    597 647 159 317 A dog ran fast
SPFH    D307     711    535 323 793 769 A dog ran fast
SPFH    D355     155    744 772 689 509 A dog ran fast
SPFH    D893     155    744 772 689 509 A dog ran fast

df2

Product basecode    A   B   C   D   E   F
Tractor A810     382    512 363 553 530 A dog ran fast
Tractor A773     222    155 650 278 215 A dog ran fast
Tractor A203     382    512 363 553 530 A dog ran fast
Tractor A329     332    459 251 341 475 A dog ran fast
Combine B 244    244    714 467 122 340 A dog ran fast
Combine B302     257    758 230 704 715 A dog ran fast
Combine B681     670    626 572 795 323 A dog ran fast
Combine B514     768    510 546 542 582 A dog ran fast
Sprayer C850     553    624 557 660 337 A dog ran fast
Sprayer C202     561    733 443 107 526 A dog ran fast
Sprayer C619     256    226 257 770 633 A dog ran fast
Sprayer C292     1  1   1   1   1   A dog ran fast
SPFH    D126     323    597 647 159 317 A dog ran fast
SPFH    D307     711    535 323 793 769 A dog ran fast
SPFH    D355     155    744 772 689 509 A dog ran fast
SPFH    D893     1  1   1   1   1   A dog ran fast
Tractor A810     491    765 457 249 641 A dog ran fast
Tractor A773     222    155 650 278 215 A dog ran fast
Tractor A203     382    512 363 553 530 A dog ran fast
Tractor A329     332    459 251 341 475 A dog ran fast
Combine B 244    244    714 467 122 340 A dog ran fast
Combine B302     257    758 230 704 715 A cat ran slow
Combine B681     670    626 572 795 323 cat
Combine B514     768    510 546 542 582 A dog ran fast
Sprayer C850     553    624 557 660 337 A dog ran fast
Sprayer C202     561    733 443 107 526 A dog ran fast
Sprayer C619     256    226 257 770 633 A dog ran fast

代码

# add id to identify which rows are not matching
df2 <- df2 %>% mutate(id = basecode)

df_unmatch <- anti_join(df2, df1)

# list of non-match are the ids of df_unmatch
df_unmatch$id

数据

#structure(list(Product = c("Tractor", "Tractor", "Tractor", "Tractor", 
"Combine", "Combine", "Combine", "Combine", "Sprayer", "Sprayer", 
"Sprayer", "Sprayer", "SPFH", "SPFH", "SPFH", "SPFH", "Tractor", 
"Tractor", "Tractor", "Tractor", "Combine", "Combine", "Combine", 
"Combine", "Sprayer", "Sprayer", "Sprayer"), basecode = c("A810", 
"A773", "A203", "A329", "B 244", "B302", "B681", "B514", "C850", 
"C202", "C619", "C292", "D126", "D307", "D355", "D893", "A810", 
"A773", "A203", "A329", "B 244", "B302", "B681", "B514", "C850", 
"C202", "C619"), A = c(382, 222, 382, 332, 244, 257, 670, 768, 
553, 561, 256, 1, 323, 711, 155, 1, 491, 222, 382, 332, 244, 
257, 670, 768, 553, 561, 256), B = c(512, 155, 512, 459, 714, 
758, 626, 510, 624, 733, 226, 1, 597, 535, 744, 1, 765, 155, 
512, 459, 714, 758, 626, 510, 624, 733, 226), C = c(363, 650, 
363, 251, 467, 230, 572, 546, 557, 443, 257, 1, 647, 323, 772, 
1, 457, 650, 363, 251, 467, 230, 572, 546, 557, 443, 257), D = c(553, 
278, 553, 341, 122, 704, 795, 542, 660, 107, 770, 1, 159, 793, 
689, 1, 249, 278, 553, 341, 122, 704, 795, 542, 660, 107, 770
), E = c(530, 215, 530, 475, 340, 715, 323, 582, 337, 526, 633, 
1, 317, 769, 509, 1, 641, 215, 530, 475, 340, 715, 323, 582, 
337, 526, 633), F = c("A dog ran fast", "A dog ran fast", "A dog ran fast", 
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast", 
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast", 
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast", 
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast", 
"A dog ran fast", "A dog ran fast", "A dog ran fast", "cat", 
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast"
)), row.names = c(NA, -27L), class = c("tbl_df", "tbl", "data.frame"
))

它确实有效,除非您有特殊期望(请参阅 Limey 的评论)。您提供的两个文件实际上是相同的(请参阅 MonJeanJean 的评论),所以让我们从创建不匹配的行开始:

df1$F <- "A dog ran faster" ## df2 has "cat" somewhere
df2$A[16] <- 155
anti_join(df2, df1)

# A tibble: 2 x 8
  Product basecode     A     B     C     D     E F             
  <chr>   <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <chr>         
1 SPFH    D893         1     1     1     1     1 A dog ran fast
2 Combine B681       670   626   572   795   323 cat           

你期待什么结果?

不是最易读的解决方案,但如果你有很多行,可能会有用.. 你可以获得匹配和不匹配的行..

library(data.table)

match <- merge(as.data.table(df1)[, c(.SD, .(source = "df1", id1 = 1:nrow(df1)))], 
      as.data.table(df2)[, c(.SD, .(source = "df2", id2 = 1:nrow(df1)))], 
      by = c("Product",  "basecode", "A",        "B",        "C",        "D",        "E",        "F" ), 
      all = TRUE)[!is.na(source.x) & !is.na(source.y)]

unmatch <- merge(as.data.table(df1)[, c(.SD, .(source = "df1", id1 = 1:nrow(df1)))], 
                 as.data.table(df2)[, c(.SD, .(source = "df2", id2 = 1:nrow(df1)))], 
               by = c("Product",  "basecode", "A",        "B",        "C",        "D",        "E",        "F" ), 
               all = TRUE)[is.na(source.x) | is.na(source.y)]