检查 data.frame 是否是另一个 data.frame 的子集
Check if data.frame is a subset of another data.frame
假设我有以下查找 table:
(lkp <- structure(list(a = c("a", "a", "a", "b", "c"),
b = c("a1 a2", "a3 a2", "a3", "a1", "a1")),
row.names = c("lkp_1", "lkp_2", "lkp_3", "lkp_4", "lkp_5"),
class = "data.frame"))
# a b
# lkp_1 a a1 a2
# lkp_2 a a3 a2
# lkp_3 a a3
# lkp_4 b a1
# lkp_5 c a1
我想检查另一个 data.frame
、x
是否是 lkp
的子集,以及重要的附加要求,即对于列 b
匹配意味着 lkp$b
只需要 包含 x$b
.
下面的例子应该能说明我的意思:
(chk <- list(c1 = structure(list(a = c("a", "a"), b = c("a2", "a2")), row.names = c(NA, -2L), class = "data.frame"),
c2 = structure(list(a = "b", b = "a1"), row.names = c(NA, -1L), class = "data.frame"),
c3 = structure(list(a = c("a", "a"), b = c("a1", "a1")), row.names = c(NA, -2L), class = "data.frame"),
c4 = structure(list(a = c("a", "a"), b = c("a3", "a2")), row.names = c(NA, -2L), class = "data.frame")))
# $c1
# a b
# 1 a a2
# 2 a a2
# $c2
# a b
# 1 b a1
# $c3
# a b
# 1 a a1
# 2 a a1
# $c4
# a b
# 1 a a3
# 2 a a2
chk$c1
:第 1 行匹配行 lkp_1
(和 lkp_2
),因为列 a
相同且 lkp$b
包含 a2
chk$c2
和 chk$c4
也匹配
chk$c3
不 匹配。虽然每一行匹配 lkp_1
,但 c4
不是子集,因为 lkp
需要包含 2 个不同的 匹配的行。
原则上我正在寻找合并(或连接),其中连接条件将使用某种模糊匹配。
我找到并阅读了这两个 SO 答案:
- R merge data frames, allow inexact ID matching (e.g. with additional characters 1234 matches ab1234 )
尤其是第二个答案看起来很有希望。但是,我不需要近似匹配,而是某种 does_contain
关系而不是纯粹的相等。那么也许 regex
解决方案可行?
预期结果
magic_is_subset_function <- function(chk, lkp) {
# ...
}
sapply(chk, magic_is_subset_function, lkp = lkp)
# [1] TRUE TRUE FALSE TRUE
sapply(
chk,
function(v) {
sum(
rowSums(sapply(v$a, `==`, lkp$a) &
sapply(v$b, grepl, x = lkp$b)) > 0
) >= nrow(v)
}
)
或
sapply(
chk,
function(v) {
sum(
colSums(
do.call(
`&`,
Map(
function(x, y) outer(x, y, FUN = Vectorize(function(a, b) grepl(a, b))),
v,
lkp
)
)
) > 0
) >= nrow(v)
}
)
这给出了
c1 c2 c3 c4
TRUE TRUE FALSE FALSE
假设我有以下查找 table:
(lkp <- structure(list(a = c("a", "a", "a", "b", "c"),
b = c("a1 a2", "a3 a2", "a3", "a1", "a1")),
row.names = c("lkp_1", "lkp_2", "lkp_3", "lkp_4", "lkp_5"),
class = "data.frame"))
# a b
# lkp_1 a a1 a2
# lkp_2 a a3 a2
# lkp_3 a a3
# lkp_4 b a1
# lkp_5 c a1
我想检查另一个 data.frame
、x
是否是 lkp
的子集,以及重要的附加要求,即对于列 b
匹配意味着 lkp$b
只需要 包含 x$b
.
下面的例子应该能说明我的意思:
(chk <- list(c1 = structure(list(a = c("a", "a"), b = c("a2", "a2")), row.names = c(NA, -2L), class = "data.frame"),
c2 = structure(list(a = "b", b = "a1"), row.names = c(NA, -1L), class = "data.frame"),
c3 = structure(list(a = c("a", "a"), b = c("a1", "a1")), row.names = c(NA, -2L), class = "data.frame"),
c4 = structure(list(a = c("a", "a"), b = c("a3", "a2")), row.names = c(NA, -2L), class = "data.frame")))
# $c1
# a b
# 1 a a2
# 2 a a2
# $c2
# a b
# 1 b a1
# $c3
# a b
# 1 a a1
# 2 a a1
# $c4
# a b
# 1 a a3
# 2 a a2
chk$c1
:第 1 行匹配行lkp_1
(和lkp_2
),因为列a
相同且lkp$b
包含a2
chk$c2
和chk$c4
也匹配chk$c3
不 匹配。虽然每一行匹配lkp_1
,但c4
不是子集,因为lkp
需要包含 2 个不同的 匹配的行。
原则上我正在寻找合并(或连接),其中连接条件将使用某种模糊匹配。
我找到并阅读了这两个 SO 答案:
- R merge data frames, allow inexact ID matching (e.g. with additional characters 1234 matches ab1234 )
尤其是第二个答案看起来很有希望。但是,我不需要近似匹配,而是某种 does_contain
关系而不是纯粹的相等。那么也许 regex
解决方案可行?
预期结果
magic_is_subset_function <- function(chk, lkp) {
# ...
}
sapply(chk, magic_is_subset_function, lkp = lkp)
# [1] TRUE TRUE FALSE TRUE
sapply(
chk,
function(v) {
sum(
rowSums(sapply(v$a, `==`, lkp$a) &
sapply(v$b, grepl, x = lkp$b)) > 0
) >= nrow(v)
}
)
或
sapply(
chk,
function(v) {
sum(
colSums(
do.call(
`&`,
Map(
function(x, y) outer(x, y, FUN = Vectorize(function(a, b) grepl(a, b))),
v,
lkp
)
)
) > 0
) >= nrow(v)
}
)
这给出了
c1 c2 c3 c4
TRUE TRUE FALSE FALSE