根据多个条件匹配不同数据框中的行，而不使用 for 循环

Question

我的数据包含两个不同的数据帧：

visits <- data.frame("visit_nr", "label", "degree", "code")
category <- data.frame("label", "degree", "group", "code1", "code2, "code3")

我想根据两个数据帧之间 "label"、"degree" 和 "code" 中的匹配为数据帧 "visits" 中的每次访问分配一个组。但是，如果数据框 "category" 中的 "code2" 和 "code3" 也列在数据框 [=29= 中，则某个 "visit_nr" 中的一行只能分配给特定组].这意味着要将一行分配给某个组，需要三行具有相同的 "visit_nr"，其中 "label"； "degree" 和 "code" 匹配：

- "label", "degree", "code1"
- "label", "degree", "code2"
- "label", "degree", "code3"

因为这些数据帧都包含超过 50 000 行，所以我想避免使用循环来完成此操作。

访问次数

visit_nr   | label | degree | code   |  Group
1601704801 |  171  |    1   | 354373 |   0
1601704801 |  171  |    1   | 200200 |   0
1601704801 |  171  |    1   | 973443 |   0
1601704801 |  171  |    1   | 475985 |   0
1601704801 |  171  |    1   | 994320 |   0

类别

label | degree | group | code1 | code2 | code3
 171  |   1    |   2   | 354373| 200200| 475985 
 171  |   1    |   3   | 354373| 200200| 998282
 171  |   1    |   1   | 354373| 200200| 0

预期输出：

visit_nr   | label | degree | code   |  Group 
1601704801 |  171  |    1   | 354373 |   2
1601704801 |  171  |    1   | 200200 |   2
1601704801 |  171  |    1   | 973443 |   2
1601704801 |  171  |    1   | 475985 |   2
1601704801 |  171  |    1   | 994320 |   2

Answer 1

Merge 2 个表 3 次，然后像这样 rbind 它们：

df1 <- merge(visits, category, by.x = c("label", "degree", "code"), by.y = c("label", "degree", "code1"), all.x = TRUE)
df2 <- merge(visits, category, by.x = c("label", "degree", "code"), by.y = c("label", "degree", "code2"), all.x = TRUE)
df3 <- merge(visits, category, by.x = c("label", "degree", "code"), by.y = c("label", "degree", "code3"), all.x = TRUE)
#change the column names using names(df) here to maintain consistency
df <- rbind(df1, df2, df3)

Answer 2

还有一种替代方法，将 category 从宽格式重塑为长格式，与 visits 连接并计算可以找到多少匹配代码：

library(data.table)
# reshape from wide to long format
lcat <- melt(setDT(category), measure.vars = patterns("^code"),
     value.name = "code")
# join and count
tmp <- lcat[setDT(visits), on = .(label, degree, code), nomatch = 0L][
  , .N, by = .(visit_nr, label, degree, group)][
    N == 3L]
tmp[]

     visit_nr label degree group N
1: 1601704801   171      1     2 3

# update join
visits[tmp, on = .(visit_nr, label, degree), Group := group, mult = "first"][]
visits[]

     visit_nr label degree   code Group
1: 1601704801   171      1 354373     2
2: 1601704801   171      1 200200     2
3: 1601704801   171      1 973443     2
4: 1601704801   171      1 475985     2
5: 1601704801   171      1 994320     2

编辑

在中，OP 披露了

not all rows in the columns code2 and code3 in the dataframe category have a value. It also happens that only code1 has a value different from 0 and code2 and code3 have a value of 0. In this case only the first code has to be present within a certain visit_nr to assign the matching group to the entire visit_nr

因此，简单检查是否存在 3 个完全匹配的代码确实适用于示例数据集，但不适用于 OP 的生产数据集。

我相信附加要求可以通过两个修改来满足：

从 long

code == 0

如果 tmp 包含多个匹配项，则选择具有最高 N 的匹配项。如果有平局，which.max() 会选择遇到的第一个。

因此，代码变为：

library(data.table)
lcat <- melt(setDT(category), measure.vars = patterns("^code"),
             value.name = "code")[code != 0]
tmp <- lcat[setDT(visits), on = .(label, degree, code), nomatch = 0L][
  , .N, by = .(visit_nr, label, degree, group)][
    , .SD[which.max(N)], by = .(visit_nr, label, degree)]
visits[tmp, on = .(visit_nr, label, degree), Group := group]
visits[]

     visit_nr label degree   code Group
1: 1601704801   171      1 354373     2
2: 1601704801   171      1 200200     2
3: 1601704801   171      1 973443     2
4: 1601704801   171      1 475985     2
5: 1601704801   171      1 994320     2

数据

library(data.table)

visits <- fread("
visit_nr   | label | degree | code   |  Group
1601704801 |  171  |    1   | 354373 |   0
1601704801 |  171  |    1   | 200200 |   0
1601704801 |  171  |    1   | 973443 |   0
1601704801 |  171  |    1   | 475985 |   0
1601704801 |  171  |    1   | 994320 |   0
")

category <- fread("
label | degree | group | code1 | code2 | code3
 171  |   1    |   2   | 354373| 200200| 475985 
 171  |   1    |   3   | 354373| 200200| 998282
 171  |   1    |   1   | 354373| 200200| 0
")

根据多个条件匹配不同数据框中的行，而不使用 for 循环

Match rows in different dataframes based on multiple criteria without using for-loops

merge

r

left-join

matching

dataframe

编辑

数据