两列之间的可变子串匹配
Variable substring match between two column
我有一个包含 20,000 行的数据集,其最纯粹的形式如下所示:
v1 v2
1 Case 1 (A v. B) A v. B
2 Case 2 (A v. C) A v. B
3 Case 2 (A v. C) C v. B
4 Case 4 (X v. Z) X v. Z
5 Case 5 (B v. A) A v. B
6 Case 6 (X v. A) X v. A
7 Case 6 (X v. A) A v. X
...
...除了 v1、v2 有 n 多种变体(实际上大约 150 个,但仍然太多无法列出)。
我想要 return 第三列 v3 包含一个逻辑指示符,表明 v1 的任何子字符串是否与该字符串匹配在 v2 中。
v1 v2 v3
1 Case 1 (A v. B) A v. B TRUE
2 Case 2 (A v. C) A v. B FALSE
3 Case 2 (A v. C) C v. B FALSE
4 Case 4 (X v. Z) X v. Z TRUE
5 Case 5 (B v. A) A v. B FALSE
6 Case 6 (X v. A) X v. A TRUE
7 Case 6 (X v. A) A v. X FALSE
我一直在玩这样的东西,我认为这是在正确的轨道上:
library(stringr)
x$v3 <- with(x, str_detect(v1, v2))
如果有人能给我指明 solution/workaround 的正确方向,我将不胜感激。
MWE 显示我的 str_detect() 技术不起作用:
x <- structure(list(v1 = c("Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation"
), v2 = c("Georgia v Russian Federation", " Ethiopia v South Africa Liberia v South Africa",
" Cameroon v United Kingdom", " New Zealand v France", " Australia v France",
" Nicaragua v United States of America", " Nicaragua v Honduras",
" Nauru v Anustralia", " Nnew Zealand v France", " Islamic Republic of Iran v United States of America",
" Bosnia and Herzegovina v Serbia and Montenegro", " Spain v Cananda",
" Libyan Arab Jamahiriya v United States of America", " Libyan Arab Jamahiriya v United Kingdom",
" Democratic Republic of the Congo v Burundi", " Germany v United States of America",
" Democratic Republic of the Congo v Belgium", " Liechtenstein v Germany",
" Democratic Republic of the Congo v Ugandan", " Democratic Republic of the Congo v Rwandan",
" Nicaragua v Colombia", " Djibouti v France", " Georgia v Russian Federation",
" Croatia v Serbia", " Mexico v United States of American", " Democratic Republic of the Congo v Rwanda",
" Spain v Canada", " Australia v France", " New Zealand v France",
" New Zealand v France")), .Names = c("v1", "v2"
), row.names = c(NA, 30L), class = "data.frame")
grepl
可用于将 v2 中的单个值与 v1
的可能子字符串进行比较
您需要分别为每一行应用它,因此一个快速的解决方案是:
apply(data.frame(v1,v2),MARGIN=1, FUN=function(x) {grepl(x[2],x[1])})
如果您想忽略空格数量的差异(如第 1 行中的差异),您可以使用 gsub 将 x[2] 中的值替换为适当的正则表达式,因此 " "
将是替换为 " *"
以允许多个空格。
在这种情况下,此应用将起作用:
apply(x,MARGIN=1, FUN=function(x) {grepl(gsub(" "," *",x[2]),x[1])})
我有一个包含 20,000 行的数据集,其最纯粹的形式如下所示:
v1 v2
1 Case 1 (A v. B) A v. B
2 Case 2 (A v. C) A v. B
3 Case 2 (A v. C) C v. B
4 Case 4 (X v. Z) X v. Z
5 Case 5 (B v. A) A v. B
6 Case 6 (X v. A) X v. A
7 Case 6 (X v. A) A v. X
...
...除了 v1、v2 有 n 多种变体(实际上大约 150 个,但仍然太多无法列出)。
我想要 return 第三列 v3 包含一个逻辑指示符,表明 v1 的任何子字符串是否与该字符串匹配在 v2 中。
v1 v2 v3
1 Case 1 (A v. B) A v. B TRUE
2 Case 2 (A v. C) A v. B FALSE
3 Case 2 (A v. C) C v. B FALSE
4 Case 4 (X v. Z) X v. Z TRUE
5 Case 5 (B v. A) A v. B FALSE
6 Case 6 (X v. A) X v. A TRUE
7 Case 6 (X v. A) A v. X FALSE
我一直在玩这样的东西,我认为这是在正确的轨道上:
library(stringr)
x$v3 <- with(x, str_detect(v1, v2))
如果有人能给我指明 solution/workaround 的正确方向,我将不胜感激。
MWE 显示我的 str_detect() 技术不起作用:
x <- structure(list(v1 = c("Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation",
"Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation"
), v2 = c("Georgia v Russian Federation", " Ethiopia v South Africa Liberia v South Africa",
" Cameroon v United Kingdom", " New Zealand v France", " Australia v France",
" Nicaragua v United States of America", " Nicaragua v Honduras",
" Nauru v Anustralia", " Nnew Zealand v France", " Islamic Republic of Iran v United States of America",
" Bosnia and Herzegovina v Serbia and Montenegro", " Spain v Cananda",
" Libyan Arab Jamahiriya v United States of America", " Libyan Arab Jamahiriya v United Kingdom",
" Democratic Republic of the Congo v Burundi", " Germany v United States of America",
" Democratic Republic of the Congo v Belgium", " Liechtenstein v Germany",
" Democratic Republic of the Congo v Ugandan", " Democratic Republic of the Congo v Rwandan",
" Nicaragua v Colombia", " Djibouti v France", " Georgia v Russian Federation",
" Croatia v Serbia", " Mexico v United States of American", " Democratic Republic of the Congo v Rwanda",
" Spain v Canada", " Australia v France", " New Zealand v France",
" New Zealand v France")), .Names = c("v1", "v2"
), row.names = c(NA, 30L), class = "data.frame")
grepl
可用于将 v2 中的单个值与 v1
您需要分别为每一行应用它,因此一个快速的解决方案是:
apply(data.frame(v1,v2),MARGIN=1, FUN=function(x) {grepl(x[2],x[1])})
如果您想忽略空格数量的差异(如第 1 行中的差异),您可以使用 gsub 将 x[2] 中的值替换为适当的正则表达式,因此 " "
将是替换为 " *"
以允许多个空格。
在这种情况下,此应用将起作用:
apply(x,MARGIN=1, FUN=function(x) {grepl(gsub(" "," *",x[2]),x[1])})