如何逐行比较两个数据帧?
How to compare two data frames line by line?
大约一个月前,我发布了原始问题,我需要逐行比较两个数据帧并标记 df2(第二个文件)中与 df1(第一个文件)不匹配的行。解决方案是使用反连接。在我添加一个带有文本字符串的附加列之前,这非常有效。我还需要将该列包含在比较中并检测文本字符串的哪些记录不匹配。
附件是示例数据框。我需要将 df2 与 df1 进行比较,并显示 df2 中的哪些行与 df1 不匹配。我可以使用 R 中的反连接来显示哪些行不匹配,但是当我在行中有文本字符串时它不起作用。
df1
Product basecode A B C D E F
Tractor A810 382 512 363 553 530 A dog ran fast
Tractor A773 222 155 650 278 215 A dog ran fast
Tractor A203 382 512 363 553 530 A dog ran fast
Tractor A329 332 459 251 341 475 A dog ran fast
Combine B244 244 714 467 122 340 A dog ran fast
Combine B302 257 758 230 704 715 A dog ran fast
Combine B681 670 626 572 795 323 A dog ran fast
Combine B514 768 510 546 542 582 A dog ran fast
Sprayer C850 553 624 557 660 337 A dog ran fast
Sprayer C202 561 733 443 107 526 A dog ran fast
Sprayer C619 256 226 257 770 633 A dog ran fast
Sprayer C292 256 226 257 770 633 A dog ran fast
SPFH D126 323 597 647 159 317 A dog ran fast
SPFH D307 711 535 323 793 769 A dog ran fast
SPFH D355 155 744 772 689 509 A dog ran fast
SPFH D893 155 744 772 689 509 A dog ran fast
df2
Product basecode A B C D E F
Tractor A810 382 512 363 553 530 A dog ran fast
Tractor A773 222 155 650 278 215 A dog ran fast
Tractor A203 382 512 363 553 530 A dog ran fast
Tractor A329 332 459 251 341 475 A dog ran fast
Combine B 244 244 714 467 122 340 A dog ran fast
Combine B302 257 758 230 704 715 A dog ran fast
Combine B681 670 626 572 795 323 A dog ran fast
Combine B514 768 510 546 542 582 A dog ran fast
Sprayer C850 553 624 557 660 337 A dog ran fast
Sprayer C202 561 733 443 107 526 A dog ran fast
Sprayer C619 256 226 257 770 633 A dog ran fast
Sprayer C292 1 1 1 1 1 A dog ran fast
SPFH D126 323 597 647 159 317 A dog ran fast
SPFH D307 711 535 323 793 769 A dog ran fast
SPFH D355 155 744 772 689 509 A dog ran fast
SPFH D893 1 1 1 1 1 A dog ran fast
Tractor A810 491 765 457 249 641 A dog ran fast
Tractor A773 222 155 650 278 215 A dog ran fast
Tractor A203 382 512 363 553 530 A dog ran fast
Tractor A329 332 459 251 341 475 A dog ran fast
Combine B 244 244 714 467 122 340 A dog ran fast
Combine B302 257 758 230 704 715 A cat ran slow
Combine B681 670 626 572 795 323 cat
Combine B514 768 510 546 542 582 A dog ran fast
Sprayer C850 553 624 557 660 337 A dog ran fast
Sprayer C202 561 733 443 107 526 A dog ran fast
Sprayer C619 256 226 257 770 633 A dog ran fast
代码
# add id to identify which rows are not matching
df2 <- df2 %>% mutate(id = basecode)
df_unmatch <- anti_join(df2, df1)
# list of non-match are the ids of df_unmatch
df_unmatch$id
数据
#structure(list(Product = c("Tractor", "Tractor", "Tractor", "Tractor",
"Combine", "Combine", "Combine", "Combine", "Sprayer", "Sprayer",
"Sprayer", "Sprayer", "SPFH", "SPFH", "SPFH", "SPFH", "Tractor",
"Tractor", "Tractor", "Tractor", "Combine", "Combine", "Combine",
"Combine", "Sprayer", "Sprayer", "Sprayer"), basecode = c("A810",
"A773", "A203", "A329", "B 244", "B302", "B681", "B514", "C850",
"C202", "C619", "C292", "D126", "D307", "D355", "D893", "A810",
"A773", "A203", "A329", "B 244", "B302", "B681", "B514", "C850",
"C202", "C619"), A = c(382, 222, 382, 332, 244, 257, 670, 768,
553, 561, 256, 1, 323, 711, 155, 1, 491, 222, 382, 332, 244,
257, 670, 768, 553, 561, 256), B = c(512, 155, 512, 459, 714,
758, 626, 510, 624, 733, 226, 1, 597, 535, 744, 1, 765, 155,
512, 459, 714, 758, 626, 510, 624, 733, 226), C = c(363, 650,
363, 251, 467, 230, 572, 546, 557, 443, 257, 1, 647, 323, 772,
1, 457, 650, 363, 251, 467, 230, 572, 546, 557, 443, 257), D = c(553,
278, 553, 341, 122, 704, 795, 542, 660, 107, 770, 1, 159, 793,
689, 1, 249, 278, 553, 341, 122, 704, 795, 542, 660, 107, 770
), E = c(530, 215, 530, 475, 340, 715, 323, 582, 337, 526, 633,
1, 317, 769, 509, 1, 641, 215, 530, 475, 340, 715, 323, 582,
337, 526, 633), F = c("A dog ran fast", "A dog ran fast", "A dog ran fast",
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast",
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast",
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast",
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast",
"A dog ran fast", "A dog ran fast", "A dog ran fast", "cat",
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast"
)), row.names = c(NA, -27L), class = c("tbl_df", "tbl", "data.frame"
))
它确实有效,除非您有特殊期望(请参阅 Limey 的评论)。您提供的两个文件实际上是相同的(请参阅 MonJeanJean 的评论),所以让我们从创建不匹配的行开始:
df1$F <- "A dog ran faster" ## df2 has "cat" somewhere
df2$A[16] <- 155
anti_join(df2, df1)
# A tibble: 2 x 8
Product basecode A B C D E F
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 SPFH D893 1 1 1 1 1 A dog ran fast
2 Combine B681 670 626 572 795 323 cat
你期待什么结果?
不是最易读的解决方案,但如果你有很多行,可能会有用..
你可以获得匹配和不匹配的行..
library(data.table)
match <- merge(as.data.table(df1)[, c(.SD, .(source = "df1", id1 = 1:nrow(df1)))],
as.data.table(df2)[, c(.SD, .(source = "df2", id2 = 1:nrow(df1)))],
by = c("Product", "basecode", "A", "B", "C", "D", "E", "F" ),
all = TRUE)[!is.na(source.x) & !is.na(source.y)]
unmatch <- merge(as.data.table(df1)[, c(.SD, .(source = "df1", id1 = 1:nrow(df1)))],
as.data.table(df2)[, c(.SD, .(source = "df2", id2 = 1:nrow(df1)))],
by = c("Product", "basecode", "A", "B", "C", "D", "E", "F" ),
all = TRUE)[is.na(source.x) | is.na(source.y)]
大约一个月前,我发布了原始问题,我需要逐行比较两个数据帧并标记 df2(第二个文件)中与 df1(第一个文件)不匹配的行。解决方案是使用反连接。在我添加一个带有文本字符串的附加列之前,这非常有效。我还需要将该列包含在比较中并检测文本字符串的哪些记录不匹配。
附件是示例数据框。我需要将 df2 与 df1 进行比较,并显示 df2 中的哪些行与 df1 不匹配。我可以使用 R 中的反连接来显示哪些行不匹配,但是当我在行中有文本字符串时它不起作用。
df1
Product basecode A B C D E F
Tractor A810 382 512 363 553 530 A dog ran fast
Tractor A773 222 155 650 278 215 A dog ran fast
Tractor A203 382 512 363 553 530 A dog ran fast
Tractor A329 332 459 251 341 475 A dog ran fast
Combine B244 244 714 467 122 340 A dog ran fast
Combine B302 257 758 230 704 715 A dog ran fast
Combine B681 670 626 572 795 323 A dog ran fast
Combine B514 768 510 546 542 582 A dog ran fast
Sprayer C850 553 624 557 660 337 A dog ran fast
Sprayer C202 561 733 443 107 526 A dog ran fast
Sprayer C619 256 226 257 770 633 A dog ran fast
Sprayer C292 256 226 257 770 633 A dog ran fast
SPFH D126 323 597 647 159 317 A dog ran fast
SPFH D307 711 535 323 793 769 A dog ran fast
SPFH D355 155 744 772 689 509 A dog ran fast
SPFH D893 155 744 772 689 509 A dog ran fast
df2
Product basecode A B C D E F
Tractor A810 382 512 363 553 530 A dog ran fast
Tractor A773 222 155 650 278 215 A dog ran fast
Tractor A203 382 512 363 553 530 A dog ran fast
Tractor A329 332 459 251 341 475 A dog ran fast
Combine B 244 244 714 467 122 340 A dog ran fast
Combine B302 257 758 230 704 715 A dog ran fast
Combine B681 670 626 572 795 323 A dog ran fast
Combine B514 768 510 546 542 582 A dog ran fast
Sprayer C850 553 624 557 660 337 A dog ran fast
Sprayer C202 561 733 443 107 526 A dog ran fast
Sprayer C619 256 226 257 770 633 A dog ran fast
Sprayer C292 1 1 1 1 1 A dog ran fast
SPFH D126 323 597 647 159 317 A dog ran fast
SPFH D307 711 535 323 793 769 A dog ran fast
SPFH D355 155 744 772 689 509 A dog ran fast
SPFH D893 1 1 1 1 1 A dog ran fast
Tractor A810 491 765 457 249 641 A dog ran fast
Tractor A773 222 155 650 278 215 A dog ran fast
Tractor A203 382 512 363 553 530 A dog ran fast
Tractor A329 332 459 251 341 475 A dog ran fast
Combine B 244 244 714 467 122 340 A dog ran fast
Combine B302 257 758 230 704 715 A cat ran slow
Combine B681 670 626 572 795 323 cat
Combine B514 768 510 546 542 582 A dog ran fast
Sprayer C850 553 624 557 660 337 A dog ran fast
Sprayer C202 561 733 443 107 526 A dog ran fast
Sprayer C619 256 226 257 770 633 A dog ran fast
代码
# add id to identify which rows are not matching
df2 <- df2 %>% mutate(id = basecode)
df_unmatch <- anti_join(df2, df1)
# list of non-match are the ids of df_unmatch
df_unmatch$id
数据
#structure(list(Product = c("Tractor", "Tractor", "Tractor", "Tractor",
"Combine", "Combine", "Combine", "Combine", "Sprayer", "Sprayer",
"Sprayer", "Sprayer", "SPFH", "SPFH", "SPFH", "SPFH", "Tractor",
"Tractor", "Tractor", "Tractor", "Combine", "Combine", "Combine",
"Combine", "Sprayer", "Sprayer", "Sprayer"), basecode = c("A810",
"A773", "A203", "A329", "B 244", "B302", "B681", "B514", "C850",
"C202", "C619", "C292", "D126", "D307", "D355", "D893", "A810",
"A773", "A203", "A329", "B 244", "B302", "B681", "B514", "C850",
"C202", "C619"), A = c(382, 222, 382, 332, 244, 257, 670, 768,
553, 561, 256, 1, 323, 711, 155, 1, 491, 222, 382, 332, 244,
257, 670, 768, 553, 561, 256), B = c(512, 155, 512, 459, 714,
758, 626, 510, 624, 733, 226, 1, 597, 535, 744, 1, 765, 155,
512, 459, 714, 758, 626, 510, 624, 733, 226), C = c(363, 650,
363, 251, 467, 230, 572, 546, 557, 443, 257, 1, 647, 323, 772,
1, 457, 650, 363, 251, 467, 230, 572, 546, 557, 443, 257), D = c(553,
278, 553, 341, 122, 704, 795, 542, 660, 107, 770, 1, 159, 793,
689, 1, 249, 278, 553, 341, 122, 704, 795, 542, 660, 107, 770
), E = c(530, 215, 530, 475, 340, 715, 323, 582, 337, 526, 633,
1, 317, 769, 509, 1, 641, 215, 530, 475, 340, 715, 323, 582,
337, 526, 633), F = c("A dog ran fast", "A dog ran fast", "A dog ran fast",
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast",
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast",
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast",
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast",
"A dog ran fast", "A dog ran fast", "A dog ran fast", "cat",
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast"
)), row.names = c(NA, -27L), class = c("tbl_df", "tbl", "data.frame"
))
它确实有效,除非您有特殊期望(请参阅 Limey 的评论)。您提供的两个文件实际上是相同的(请参阅 MonJeanJean 的评论),所以让我们从创建不匹配的行开始:
df1$F <- "A dog ran faster" ## df2 has "cat" somewhere
df2$A[16] <- 155
anti_join(df2, df1)
# A tibble: 2 x 8
Product basecode A B C D E F
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 SPFH D893 1 1 1 1 1 A dog ran fast
2 Combine B681 670 626 572 795 323 cat
你期待什么结果?
不是最易读的解决方案,但如果你有很多行,可能会有用.. 你可以获得匹配和不匹配的行..
library(data.table)
match <- merge(as.data.table(df1)[, c(.SD, .(source = "df1", id1 = 1:nrow(df1)))],
as.data.table(df2)[, c(.SD, .(source = "df2", id2 = 1:nrow(df1)))],
by = c("Product", "basecode", "A", "B", "C", "D", "E", "F" ),
all = TRUE)[!is.na(source.x) & !is.na(source.y)]
unmatch <- merge(as.data.table(df1)[, c(.SD, .(source = "df1", id1 = 1:nrow(df1)))],
as.data.table(df2)[, c(.SD, .(source = "df2", id2 = 1:nrow(df1)))],
by = c("Product", "basecode", "A", "B", "C", "D", "E", "F" ),
all = TRUE)[is.na(source.x) | is.na(source.y)]