使用其他数据框中的 2 个链接条件（值的组合）删除 r 数据框中的行

Question

我在 R（下面的 MRE）中有两个不同的数据集。

每个模块访问包含一个日志 (ModuleViews)，另一个 (PageViews) 记录模块访问中的每个特定页面访问。

列 moduleid 包含相同的模块代码，session_id 包含两个数据集的相同会话代码。

我处理了 PageViews 数据集，现在我想相应地更新 ModuleViews 数据集。

为此，我需要 R 检查/匹配 moduleid 和 session_id 行。因为在 1 个会话（例如 25）中，用户可以访问多个模块（对于会话 25 模块 1697、1698 和 1755）。

在这种情况下，我的处理删除了会话 25、模块 1697 的所有页面视图。

我现在想从 ModuleViews 数据集中删除此行（以及所有其他行），其中 moduleid 和 session_id 与 PageViews 数据集中的不同。

我尝试了以下3种方式：

ModuleViews <- subset(ModuleViews, ModuleViews$session_id %in% PageViews$session_id & 
                         ModuleViews$moduleid %in% PageViews$moduleid)

ModuleViews <- ModuleViews[(ModuleViews$session_id %in% PageViews$session_id) && 
                         (ModuleViews$moduleid %in% PageViews$moduleid),]

ModuleViews$moduleid <- ifelse((ModuleViews$session_id %in% PageViews$session_id) & 
                         (ModuleViews$moduleid %in% PageViews$moduleid), ModuleViews$moduleid, NA)

但它不会同时查看两个列，而是单独查看，在输出中留下会话 25 模块 1697。

我用 %in% 和 == 都试过了，但是 == 我得到了一个长度错误（显然是由于不同的数据集长度）

错误：必须使用有效的下标向量对行进行子集化。 ℹ 逻辑下标必须匹配索引输入的大小。 x 输入的大小为 220099 但下标 r 的大小为 2024529.

我怎样才能实现它查看每行的两个条件？

TIA！

模块视图：

structure(list(session_id = c(19L, 19L, 24L, 25L, 25L, 25L, 28L
), moduleid = c(397L, 902L, 690L, 1697L, 1698L, 1755L, 1271L), 
    numslidesread = c(1L, 1L, 31L, 2L, 31L, 44L, 3L), totalsecondsspent = c(5L, 
    13L, 5829L, 10955L, 6942L, 9725L, 667L)), row.names = c(NA, 
-7L), class = c("tbl_df", "tbl", "data.frame"))

页面浏览量：

structure(list(session_id = c(19L, 19L, 24L, 24L, 24L, 24L, 24L, 
24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 
24L, 24L, 24L, 24L, 24L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 
25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 
25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 
25L), slideitem_id = c(19974L, 53092L, 37143L, 37004L, 37061L, 
37055L, 37061L, 37062L, 37073L, 37079L, 37079L, 37080L, 37097L, 
37124L, 37131L, 37136L, 37138L, 37143L, 37143L, 37144L, 37145L, 
37170L, 65628L, 37191L, 37192L, 85817L, 85818L, 85819L, 85820L, 
85821L, 85821L, 85822L, 85823L, 85824L, 85825L, 85826L, 85827L, 
85828L, 85829L, 85828L, 85829L, 85830L, 85831L, 85832L, 85833L, 
85834L, 85835L, 85836L, 85837L, 85838L, 85839L, 85840L, 85841L, 
85842L, 85624L, 85234L, 85235L, 85607L, 85614L, 85619L), moduleid = c(397L, 
902L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 
690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 
690L, 690L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 
1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 
1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 
1698L, 1698L, 1698L, 1698L, 1755L, 1755L, 1755L, 1755L, 1755L, 
1755L), secondsspentonslide = c(5L, 13L, 154L, 9L, 5L, 9L, 248L, 
17L, 385L, 209L, 364L, 61L, 81L, 175L, 45L, 352L, 23L, 216L, 
35L, 227L, 80L, 375L, 7L, 3L, 3L, 21L, 8L, 43L, 211L, 61L, 37L, 
58L, 50L, 96L, 67L, 36L, 21L, 11L, 3L, 7L, 96L, 66L, 9L, 79L, 
180L, 144L, 127L, 168L, 22L, 49L, 22L, 51L, 127L, 33L, 19L, 5L, 
25L, 73L, 7L, 15L)), row.names = c(NA, -60L), class = c("tbl_df", 
"tbl", "data.frame"))

Answer 1

如果我没理解错的话，你想要一个inner_join。随着 dplyr:

library(dplyr)
result = ModuleViews %>%
  inner_join(distinct(PageViews, session_id, moduleid))

result
# # A tibble: 5 × 4
#   session_id moduleid numslidesread totalsecondsspent
#        <int>    <int>         <int>             <int>
# 1         19      397             1                 5
# 2         19      902             1                13
# 3         24      690            31              5829
# 4         25     1698            31              6942
# 5         25     1755            44              9725

或使用基数 R 得到相同的结果：

result = merge(
  ModuleViews,
  unique(PageViews[c("session_id", "moduleid")])
)

使用其他数据框中的 2 个链接条件（值的组合）删除 r 数据框中的行

Remove rows in r dataframe using 2 linked conditions (a combination of values) in other dataframe

r

match