R：如何高效地找出data.frame A是否包含在data.frame B中？

Question

为了查明数据框 df.a 是否是数据框 df.b 的子集，我做了以下操作：

df.a <- data.frame( x=1:5, y=6:10 )
df.b <- data.frame( x=1:7, y=6:12 )
inds.x <- as.integer( lapply( df.a$x, function(x) which(df.b$x == x) ))
inds.y <- as.integer( lapply( df.a$y, function(y) which(df.b$y == y) ))
identical( inds.x, inds.y )

最后一行给出了 TRUE，因此 df.a 包含在 df.b 中。

现在我想知道是否有更优雅 - 可能更有效 - 的方式来回答这个问题？

这个任务也很容易扩展到找到两个给定数据框之间的交集，可能仅基于列的子集。

非常感谢您的帮助。

Answer 1

我将冒险猜测一个答案。

我认为 dplyr 中的 semi_join 会做你想做的，甚至考虑到重复的行。

先记下帮助文件?semi_join:

return all rows from x where there are matching values in y, keeping just columns from x.

A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.

好的，这表明以下应该正确地失败：

df.a <- data.frame( x=c(1:5,1), y=c(6:10,6) )
df.b <- data.frame( x=1:7, y=6:12 )
identical(semi_join(df.b, df.a),  semi_join(df.a, df.a))

这给出了 FALSE，正如预期的那样，因为

> semi_join(df.b, df.a)
Joining by: c("x", "y")
  x  y
1 1  6
2 2  7
3 3  8
4 4  9
5 5 10

但是，以下应该通过：

df.c <- data.frame( x=c(1:7, 1), y= c(6:12, 6) )
identical(semi_join(df.c, df.a), semi_join(df.a, df.a))

确实如此，给出 TRUE。

需要第二个 semi_join(df.a, df.a) 才能在 df.a 上进行规范排序。

R：如何高效地找出data.frame A是否包含在data.frame B中？

R: How to efficiently find out whether data.frame A is contained in data.frame B?

r

subset

dataframe

set-intersection