在r中的不同数据框中匹配两个以上单词的单词
Matching words with more than two number of words in different data frame in r
我有两个这样的数据框 DF1 和 DF2。
ID = c(1, 2, 3, 4)
Issues = c('Issue1, Issue4', 'Issue2, Issue5, Issue6', 'Issue3, Issue4', 'Issue1, Issue5')
Location = c('x', 'y', 'z', 'w')
Customer = c('a', 'b', 'c', 'd')
DF1 = data.frame(ID, Issues, Location, Customer)
Root_Cause = c('R1', 'R2', 'R3', 'R4')
List_of_Issues = c('Issue1, Issue3, Issue5', 'Issue2, Issue1, Issue4', 'Issue6, Issue7', 'Issue5, Issue6')
DF2 = data.frame(Root_Cause, List_of_Issues)
我想比较 DF1 的 "Issues" 和 DF2 的 "List_of_Issues" 的数据帧,如果 "Issues" 列中的两个以上的单词存在于 "List_of_Issues" 中DF2 中的列,然后我想从 DF2 填充后续 "Root_Cause"。
我生成的数据框应该看起来像 DF3。
ID = c(1, 2, 3, 4)
Issues = c('Issue1, Issue4', 'Issue2, Issue5, Issue6', 'Issue3, Issue4', 'Issue1, Issue5')
Location = c('x', 'y', 'z', 'w')
Customer = c('a', 'b', 'c', 'd')
Root_Cause = c('R2', 'R4', NA, 'R1')
DF3 = data.frame(ID, Issues, Location, Customer, Root_Cause)
使用data.table:
编辑: 我已经编辑了您的示例数据以解决多根本原因的可能性。在此数据中,ID==1
对应于 R2 和 R3。
数据
ID = c(1, 2, 3, 4)
Issues = c('Issue1, Issue4, Issue6, Issue7', 'Issue2, Issue5, Issue6', 'Issue3, Issue4', 'Issue1, Issue5')
Location = c('x', 'y', 'z', 'w')
Customer = c('a', 'b', 'c', 'd')
DF1 = data.table(ID, Issues, Location, Customer)
Root_Cause = c('R1', 'R2', 'R3', 'R4')
List_of_Issues = c('Issue1, Issue3, Issue5', 'Issue2, Issue1, Issue4', 'Issue6, Issue7', 'Issue5, Issue6')
DF2 = data.table(Root_Cause, List_of_Issues)
代码
DF1[, Issues := strsplit(Issues, split = ', ')]
DF2[, List_of_Issues := strsplit(List_of_Issues, split = ', ')]
DF1[, RootCause := lapply(Issues, function(x){
matchvec = sapply(DF2[, List_of_Issues], function(y) length(unlist(intersect(y, x))))
ids = which(matchvec > 1)
str = DF2[, paste(Root_Cause[ids], collapse = ', ')]
ifelse(str == '', NA, str)
})]
结果
> DF1
ID Issues Location Customer RootCause
1: 1 Issue1,Issue4,Issue6,Issue7 x a R2, R3
2: 2 Issue2,Issue5,Issue6 y b R4
3: 3 Issue3,Issue4 z c NA
4: 4 Issue1,Issue5 w d R1
我有两个这样的数据框 DF1 和 DF2。
ID = c(1, 2, 3, 4)
Issues = c('Issue1, Issue4', 'Issue2, Issue5, Issue6', 'Issue3, Issue4', 'Issue1, Issue5')
Location = c('x', 'y', 'z', 'w')
Customer = c('a', 'b', 'c', 'd')
DF1 = data.frame(ID, Issues, Location, Customer)
Root_Cause = c('R1', 'R2', 'R3', 'R4')
List_of_Issues = c('Issue1, Issue3, Issue5', 'Issue2, Issue1, Issue4', 'Issue6, Issue7', 'Issue5, Issue6')
DF2 = data.frame(Root_Cause, List_of_Issues)
我想比较 DF1 的 "Issues" 和 DF2 的 "List_of_Issues" 的数据帧,如果 "Issues" 列中的两个以上的单词存在于 "List_of_Issues" 中DF2 中的列,然后我想从 DF2 填充后续 "Root_Cause"。 我生成的数据框应该看起来像 DF3。
ID = c(1, 2, 3, 4)
Issues = c('Issue1, Issue4', 'Issue2, Issue5, Issue6', 'Issue3, Issue4', 'Issue1, Issue5')
Location = c('x', 'y', 'z', 'w')
Customer = c('a', 'b', 'c', 'd')
Root_Cause = c('R2', 'R4', NA, 'R1')
DF3 = data.frame(ID, Issues, Location, Customer, Root_Cause)
使用data.table:
编辑: 我已经编辑了您的示例数据以解决多根本原因的可能性。在此数据中,ID==1
对应于 R2 和 R3。
数据
ID = c(1, 2, 3, 4)
Issues = c('Issue1, Issue4, Issue6, Issue7', 'Issue2, Issue5, Issue6', 'Issue3, Issue4', 'Issue1, Issue5')
Location = c('x', 'y', 'z', 'w')
Customer = c('a', 'b', 'c', 'd')
DF1 = data.table(ID, Issues, Location, Customer)
Root_Cause = c('R1', 'R2', 'R3', 'R4')
List_of_Issues = c('Issue1, Issue3, Issue5', 'Issue2, Issue1, Issue4', 'Issue6, Issue7', 'Issue5, Issue6')
DF2 = data.table(Root_Cause, List_of_Issues)
代码
DF1[, Issues := strsplit(Issues, split = ', ')]
DF2[, List_of_Issues := strsplit(List_of_Issues, split = ', ')]
DF1[, RootCause := lapply(Issues, function(x){
matchvec = sapply(DF2[, List_of_Issues], function(y) length(unlist(intersect(y, x))))
ids = which(matchvec > 1)
str = DF2[, paste(Root_Cause[ids], collapse = ', ')]
ifelse(str == '', NA, str)
})]
结果
> DF1
ID Issues Location Customer RootCause
1: 1 Issue1,Issue4,Issue6,Issue7 x a R2, R3
2: 2 Issue2,Issue5,Issue6 y b R4
3: 3 Issue3,Issue4 z c NA
4: 4 Issue1,Issue5 w d R1