根据多个条件匹配不同数据框中的行,而不使用 for 循环
Match rows in different dataframes based on multiple criteria without using for-loops
我的数据包含两个不同的数据帧:
visits <- data.frame("visit_nr", "label", "degree", "code")
category <- data.frame("label", "degree", "group", "code1", "code2, "code3")
我想根据两个数据帧之间 "label"、"degree" 和 "code" 中的匹配为数据帧 "visits" 中的每次访问分配一个组。
但是,如果数据框 "category" 中的 "code2" 和 "code3" 也列在数据框 [=29= 中,则某个 "visit_nr" 中的一行只能分配给特定组].这意味着要将一行分配给某个组,需要三行具有相同的 "visit_nr",其中 "label"; "degree" 和 "code" 匹配:
- "label", "degree", "code1"
- "label", "degree", "code2"
- "label", "degree", "code3"
因为这些数据帧都包含超过 50 000 行,所以我想避免使用循环来完成此操作。
访问次数
visit_nr | label | degree | code | Group
1601704801 | 171 | 1 | 354373 | 0
1601704801 | 171 | 1 | 200200 | 0
1601704801 | 171 | 1 | 973443 | 0
1601704801 | 171 | 1 | 475985 | 0
1601704801 | 171 | 1 | 994320 | 0
类别
label | degree | group | code1 | code2 | code3
171 | 1 | 2 | 354373| 200200| 475985
171 | 1 | 3 | 354373| 200200| 998282
171 | 1 | 1 | 354373| 200200| 0
预期输出:
visit_nr | label | degree | code | Group
1601704801 | 171 | 1 | 354373 | 2
1601704801 | 171 | 1 | 200200 | 2
1601704801 | 171 | 1 | 973443 | 2
1601704801 | 171 | 1 | 475985 | 2
1601704801 | 171 | 1 | 994320 | 2
Merge
2 个表 3 次,然后像这样 rbind 它们:
df1 <- merge(visits, category, by.x = c("label", "degree", "code"), by.y = c("label", "degree", "code1"), all.x = TRUE)
df2 <- merge(visits, category, by.x = c("label", "degree", "code"), by.y = c("label", "degree", "code2"), all.x = TRUE)
df3 <- merge(visits, category, by.x = c("label", "degree", "code"), by.y = c("label", "degree", "code3"), all.x = TRUE)
#change the column names using names(df) here to maintain consistency
df <- rbind(df1, df2, df3)
还有一种替代方法,将 category
从宽格式重塑为长格式,与 visits
连接并计算可以找到多少匹配代码:
library(data.table)
# reshape from wide to long format
lcat <- melt(setDT(category), measure.vars = patterns("^code"),
value.name = "code")
# join and count
tmp <- lcat[setDT(visits), on = .(label, degree, code), nomatch = 0L][
, .N, by = .(visit_nr, label, degree, group)][
N == 3L]
tmp[]
visit_nr label degree group N
1: 1601704801 171 1 2 3
# update join
visits[tmp, on = .(visit_nr, label, degree), Group := group, mult = "first"][]
visits[]
visit_nr label degree code Group
1: 1601704801 171 1 354373 2
2: 1601704801 171 1 200200 2
3: 1601704801 171 1 973443 2
4: 1601704801 171 1 475985 2
5: 1601704801 171 1 994320 2
编辑
在 中,OP 披露了
not all rows in the columns code2
and code3
in the dataframe
category
have a value. It also happens that only code1
has a value
different from 0 and code2
and code3
have a value of 0. In this
case only the first code has to be present within a certain visit_nr
to assign the matching group to the entire visit_nr
因此,简单检查是否存在 3 个完全匹配的代码确实适用于示例数据集,但不适用于 OP 的生产数据集。
我相信附加要求可以通过两个修改来满足:
- 从
long
中删除所有带有 code == 0
的行
- 如果
tmp
包含多个匹配项,则选择具有最高 N
的匹配项。如果有平局,which.max()
会选择遇到的第一个。
因此,代码变为:
library(data.table)
lcat <- melt(setDT(category), measure.vars = patterns("^code"),
value.name = "code")[code != 0]
tmp <- lcat[setDT(visits), on = .(label, degree, code), nomatch = 0L][
, .N, by = .(visit_nr, label, degree, group)][
, .SD[which.max(N)], by = .(visit_nr, label, degree)]
visits[tmp, on = .(visit_nr, label, degree), Group := group]
visits[]
visit_nr label degree code Group
1: 1601704801 171 1 354373 2
2: 1601704801 171 1 200200 2
3: 1601704801 171 1 973443 2
4: 1601704801 171 1 475985 2
5: 1601704801 171 1 994320 2
数据
library(data.table)
visits <- fread("
visit_nr | label | degree | code | Group
1601704801 | 171 | 1 | 354373 | 0
1601704801 | 171 | 1 | 200200 | 0
1601704801 | 171 | 1 | 973443 | 0
1601704801 | 171 | 1 | 475985 | 0
1601704801 | 171 | 1 | 994320 | 0
")
category <- fread("
label | degree | group | code1 | code2 | code3
171 | 1 | 2 | 354373| 200200| 475985
171 | 1 | 3 | 354373| 200200| 998282
171 | 1 | 1 | 354373| 200200| 0
")
我的数据包含两个不同的数据帧:
visits <- data.frame("visit_nr", "label", "degree", "code")
category <- data.frame("label", "degree", "group", "code1", "code2, "code3")
我想根据两个数据帧之间 "label"、"degree" 和 "code" 中的匹配为数据帧 "visits" 中的每次访问分配一个组。 但是,如果数据框 "category" 中的 "code2" 和 "code3" 也列在数据框 [=29= 中,则某个 "visit_nr" 中的一行只能分配给特定组].这意味着要将一行分配给某个组,需要三行具有相同的 "visit_nr",其中 "label"; "degree" 和 "code" 匹配:
- "label", "degree", "code1"
- "label", "degree", "code2"
- "label", "degree", "code3"
因为这些数据帧都包含超过 50 000 行,所以我想避免使用循环来完成此操作。
访问次数
visit_nr | label | degree | code | Group
1601704801 | 171 | 1 | 354373 | 0
1601704801 | 171 | 1 | 200200 | 0
1601704801 | 171 | 1 | 973443 | 0
1601704801 | 171 | 1 | 475985 | 0
1601704801 | 171 | 1 | 994320 | 0
类别
label | degree | group | code1 | code2 | code3
171 | 1 | 2 | 354373| 200200| 475985
171 | 1 | 3 | 354373| 200200| 998282
171 | 1 | 1 | 354373| 200200| 0
预期输出:
visit_nr | label | degree | code | Group
1601704801 | 171 | 1 | 354373 | 2
1601704801 | 171 | 1 | 200200 | 2
1601704801 | 171 | 1 | 973443 | 2
1601704801 | 171 | 1 | 475985 | 2
1601704801 | 171 | 1 | 994320 | 2
Merge
2 个表 3 次,然后像这样 rbind 它们:
df1 <- merge(visits, category, by.x = c("label", "degree", "code"), by.y = c("label", "degree", "code1"), all.x = TRUE)
df2 <- merge(visits, category, by.x = c("label", "degree", "code"), by.y = c("label", "degree", "code2"), all.x = TRUE)
df3 <- merge(visits, category, by.x = c("label", "degree", "code"), by.y = c("label", "degree", "code3"), all.x = TRUE)
#change the column names using names(df) here to maintain consistency
df <- rbind(df1, df2, df3)
还有一种替代方法,将 category
从宽格式重塑为长格式,与 visits
连接并计算可以找到多少匹配代码:
library(data.table)
# reshape from wide to long format
lcat <- melt(setDT(category), measure.vars = patterns("^code"),
value.name = "code")
# join and count
tmp <- lcat[setDT(visits), on = .(label, degree, code), nomatch = 0L][
, .N, by = .(visit_nr, label, degree, group)][
N == 3L]
tmp[]
visit_nr label degree group N 1: 1601704801 171 1 2 3
# update join
visits[tmp, on = .(visit_nr, label, degree), Group := group, mult = "first"][]
visits[]
visit_nr label degree code Group 1: 1601704801 171 1 354373 2 2: 1601704801 171 1 200200 2 3: 1601704801 171 1 973443 2 4: 1601704801 171 1 475985 2 5: 1601704801 171 1 994320 2
编辑
在
not all rows in the columns
code2
andcode3
in the dataframecategory
have a value. It also happens that onlycode1
has a value different from 0 andcode2
andcode3
have a value of 0. In this case only the first code has to be present within a certainvisit_nr
to assign the matching group to the entirevisit_nr
因此,简单检查是否存在 3 个完全匹配的代码确实适用于示例数据集,但不适用于 OP 的生产数据集。
我相信附加要求可以通过两个修改来满足:
- 从
long
中删除所有带有 - 如果
tmp
包含多个匹配项,则选择具有最高N
的匹配项。如果有平局,which.max()
会选择遇到的第一个。
code == 0
的行
因此,代码变为:
library(data.table)
lcat <- melt(setDT(category), measure.vars = patterns("^code"),
value.name = "code")[code != 0]
tmp <- lcat[setDT(visits), on = .(label, degree, code), nomatch = 0L][
, .N, by = .(visit_nr, label, degree, group)][
, .SD[which.max(N)], by = .(visit_nr, label, degree)]
visits[tmp, on = .(visit_nr, label, degree), Group := group]
visits[]
visit_nr label degree code Group 1: 1601704801 171 1 354373 2 2: 1601704801 171 1 200200 2 3: 1601704801 171 1 973443 2 4: 1601704801 171 1 475985 2 5: 1601704801 171 1 994320 2
数据
library(data.table)
visits <- fread("
visit_nr | label | degree | code | Group
1601704801 | 171 | 1 | 354373 | 0
1601704801 | 171 | 1 | 200200 | 0
1601704801 | 171 | 1 | 973443 | 0
1601704801 | 171 | 1 | 475985 | 0
1601704801 | 171 | 1 | 994320 | 0
")
category <- fread("
label | degree | group | code1 | code2 | code3
171 | 1 | 2 | 354373| 200200| 475985
171 | 1 | 3 | 354373| 200200| 998282
171 | 1 | 1 | 354373| 200200| 0
")