检查一个数据帧的值是否按确切顺序存在于另一个数据帧中

Question

我有 1 个数据框和多个“参考”数据框。我正在尝试自动检查数据框的值是否与参考数据框的值匹配。重要的是，这些值还必须与参考数据框中的值具有相同的顺序。这些列是重要的列，但我的真实数据集包含更多列。

下面是一个玩具数据集。

Dataframe

group   type    value
1       A       Teddy
1       A       William
1       A       Lars
2       B       Dolores
2       B       Elsie
2       C       Maeve
2       C       Charlotte
2       C       Bernard


Reference_A

type    value
A       Teddy
A       William
A       Lars

Reference_B

type    value
B       Elsie
B       Dolores

Reference_C

type    value
C       Maeve
C       Hale
C       Bernard

例如，在玩具数据集中，group1 的得分为 1.0（100% 正确），因为它在 A 中的所有值都与 reference_A 中 An 的值和值的顺序相匹配。但是，group2 的得分为 0.0，因为 B 中的值与 reference_B 相比顺序混乱，而 0.66 是因为 C 中的 2/3 值与 reference_C.

中的值和值的顺序匹配

期望的输出

group   type    score
1       A       1.0
2       B       0.0
2       C       0.66

这很有帮助，但没有考虑顺序： Check whether values in one data frame column exist in a second data frame

更新：感谢所有提供解决方案的人！这些解决方案非常适合玩具数据集，但尚未适应具有更多列的数据集。同样，就像我在 post 中写的那样，我上面列出的列很重要 — 如有必要，我不希望删除不需要的列。

Answer 1

这是一个“整洁”的方法：

library(dplyr)
# library(purrr) # map2_dbl
Reference <- bind_rows(Reference_A, Reference_B, Reference_C) %>%
  nest_by(type, .key = "ref") %>%
  ungroup()
Reference
# # A tibble: 3 x 2
#   type                 ref
#   <chr> <list<tbl_df[,1]>>
# 1 A                [3 x 1]
# 2 B                [2 x 1]
# 3 C                [3 x 1]

Dataframe %>%
  nest_by(group, type, .key = "data") %>%
  left_join(Reference, by = "type") %>%
  mutate(
    score = purrr::map2_dbl(data, ref, ~ {
      if (length(.x) == 0 || length(.y) == 0) return(numeric(0))
      if (length(.x) != length(.y)) return(0)
      sum((is.na(.x) & is.na(.y)) | .x == .y) / length(.x)
    })
  ) %>%
  select(-data, -ref) %>%
  ungroup()
# # A tibble: 3 x 3
#   group type  score
#   <int> <chr> <dbl>
# 1     1 A     1    
# 2     2 B     0    
# 3     2 C     0.667

Answer 2

这是另一个 tidyverse 解决方案。在这里，我为参考和数据添加了一个计数器（即 rowname）。然后我在 type 和 rowname 上加入他们。最后，我在 type 上总结它们以获得所需的输出。

library(dplyr)
library(purrr)
library(tibble)

list(`Reference A`, `Reference B`, `Reference C`) %>% 
  map(., rownames_to_column) %>% 
  bind_rows %>% 
 left_join({Dataframe %>%
             group_split(type) %>% 
             map(., rownames_to_column) %>% 
             bind_rows}, 
             . , by=c("type", "rowname")) %>% 
  group_by(type) %>% 
  dplyr::summarise(group = head(group,1),
            score = sum(value.x == value.y)/n())

#> # A tibble: 3 x 3
#>   type  group score
#>   <chr> <int> <dbl>
#> 1 A         1 1    
#> 2 B         2 0    
#> 3 C         2 0.667

Answer 3

我们也可以这样用mget到return一个list of data.frames，把他们绑在一起，做一个组由mean of逻辑向量

library(dplyr)
mget(ls(pattern = '^Reference_[A-Z]$')) %>%
    bind_rows() %>% 
    bind_cols(df1) %>% 
    group_by(group, type = type...1) %>% 
    summarise(score = mean(value...2 == value...5))
# Groups:   group [2]
#  group type  score
#  <int> <chr> <dbl>
#1     1 A     1    
#2     2 B     0    
#3     2 C     0.667

检查一个数据帧的值是否按确切顺序存在于另一个数据帧中

Check if values of one dataframe exist in another dataframe in exact order

r

dataframe

data-wrangling