R 部分字符串匹配和 return 值(在 R 中)
R partial string matching and return value (in R)
我有多个采购数据库,我需要在这些数据库上 运行 我已经建立的 "keywords" 列表来识别某些产品,如果匹配,我想将产品标记为手术类别。
举个例子。
采购数据库(其实我还有200万多行要过):
d<-data.frame(prod_desc=c("BANDELETTE TVTO-OBTRYX HALO", "BANDELETTE MINI ARC PRECISES", "BANDELETTE D'ANALYSE POUR GLYCEMIE", "DIACH. BANDELETTE STER 19MM X 72MM","SLING MALE SYSTEM","DIACHILON","AIGUILLE","GANT","LABEL","CRAYON"),label=1:10)
关键字列表和 return 值(实际列表更长):
kw<-data.frame(kw=c("bandelette","tvt","bande transvaginale","sling system","argus"),category="ss_bandelette")
我想找到包含我的关键字字符串 kw
的产品 prod_desc
,如果匹配,我想在 d
数据框中添加一列return category
与 kw
数据帧中的 kw
关联。
现在我可以使用以下代码实现预期的结果:
d$match <- ifelse(d$cat <- grepl(paste(kw$kw,collapse="|"), d$name,ignore.case = TRUE) == "TRUE","SS_Bandelette","-")
但是这段代码并不是很有效,因为我有大约 350 个关键字被映射到大约 30 个不同的类别。如果我的关键字之一被触发,我可以使用什么代码在 d
数据框中自动 return 类别?
非常感谢您的帮助。
菲尔
# made all to lowercase
d$prod_desc <- tolower(d$prod_desc)
# create a logical matrix that specifies which keywords are present on each row of 'd'
m = data.frame(sapply(kw$kw, grepl, d$prod_desc))
colnames(m) = kw$kw
# create a column in 'd' with the corresponding keyword
d$kw <- apply(m, 1, function(x) names(x)[which(x)[1]])
# simple merge
merge(d, kw, by = "kw", all.x = T)
# kw prod_desc label category
#1 bandelette bandelette d'analyse pour glycemie 3 ss_bandelette
#2 bandelette diach. bandelette ster 19mm x 72mm 4 ss_bandelette
#3 bandelette bandelette tvto-obtryx halo 1 ss_bandelette
#4 bandelette bandelette mini arc precises 2 ss_bandelette
#5 <NA> sling male system 5 <NA>
#6 <NA> diachilon 6 <NA>
#7 <NA> aiguille 7 <NA>
#8 <NA> gant 8 <NA>
#9 <NA> label 9 <NA>
#10 <NA> crayon 10 <NA>
# Create dataframe as per original question
d<-data.frame(prod_desc=c("BANDELETTE TVTO-OBTRYX HALO", "BANDELETTE MINI ARC PRECISES", "BANDELETTE D'ANALYSE POUR GLYCEMIE", "DIACH. BANDELETTE STER 19MM X 72MM","SLING MALE SYSTEM","DIACHILON","AIGUILLE","GANT","LABEL","CRAYON"),label=1:10)
# Create keywords as per origianl question
kw<-data.frame(kw=c("bandelette","tvt","bande transvaginale","sling system","argus"),category="ss_bandelette")
# Assume you want match/tag string on word boundaries? If not; "BANDELETTE TVTO-OBTRYX HALO" would match to "tvt" for instance.
kw$kw <- paste0("\b",kw$kw,"\b")
x <- sapply(kw$kw, function(x) grepl(tolower(x), tolower(d$prod_desc)))
d$Match <- apply(x, 1, function(i) paste0(names(i)[i]))
d$Match <- kw$category[match(d$Match,kw$kw)]
d
# prod_desc label Match
# 1 BANDELETTE TVTO-OBTRYX HALO 1 ss_bandelette
# 2 BANDELETTE MINI ARC PRECISES 2 ss_bandelette
# 3 BANDELETTE D'ANALYSE POUR GLYCEMIE 3 ss_bandelette
# 4 DIACH. BANDELETTE STER 19MM X 72MM 4 ss_bandelette
# 5 SLING MALE SYSTEM 5 <NA>
# 6 DIACHILON 6 <NA>
# 7 AIGUILLE 7 <NA>
# 8 GANT 8 <NA>
# 9 LABEL 9 <NA>
# 10 CRAYON 10 <NA>
我有多个采购数据库,我需要在这些数据库上 运行 我已经建立的 "keywords" 列表来识别某些产品,如果匹配,我想将产品标记为手术类别。
举个例子。
采购数据库(其实我还有200万多行要过):
d<-data.frame(prod_desc=c("BANDELETTE TVTO-OBTRYX HALO", "BANDELETTE MINI ARC PRECISES", "BANDELETTE D'ANALYSE POUR GLYCEMIE", "DIACH. BANDELETTE STER 19MM X 72MM","SLING MALE SYSTEM","DIACHILON","AIGUILLE","GANT","LABEL","CRAYON"),label=1:10)
关键字列表和 return 值(实际列表更长):
kw<-data.frame(kw=c("bandelette","tvt","bande transvaginale","sling system","argus"),category="ss_bandelette")
我想找到包含我的关键字字符串 kw
的产品 prod_desc
,如果匹配,我想在 d
数据框中添加一列return category
与 kw
数据帧中的 kw
关联。
现在我可以使用以下代码实现预期的结果:
d$match <- ifelse(d$cat <- grepl(paste(kw$kw,collapse="|"), d$name,ignore.case = TRUE) == "TRUE","SS_Bandelette","-")
但是这段代码并不是很有效,因为我有大约 350 个关键字被映射到大约 30 个不同的类别。如果我的关键字之一被触发,我可以使用什么代码在 d
数据框中自动 return 类别?
非常感谢您的帮助。
菲尔
# made all to lowercase
d$prod_desc <- tolower(d$prod_desc)
# create a logical matrix that specifies which keywords are present on each row of 'd'
m = data.frame(sapply(kw$kw, grepl, d$prod_desc))
colnames(m) = kw$kw
# create a column in 'd' with the corresponding keyword
d$kw <- apply(m, 1, function(x) names(x)[which(x)[1]])
# simple merge
merge(d, kw, by = "kw", all.x = T)
# kw prod_desc label category
#1 bandelette bandelette d'analyse pour glycemie 3 ss_bandelette
#2 bandelette diach. bandelette ster 19mm x 72mm 4 ss_bandelette
#3 bandelette bandelette tvto-obtryx halo 1 ss_bandelette
#4 bandelette bandelette mini arc precises 2 ss_bandelette
#5 <NA> sling male system 5 <NA>
#6 <NA> diachilon 6 <NA>
#7 <NA> aiguille 7 <NA>
#8 <NA> gant 8 <NA>
#9 <NA> label 9 <NA>
#10 <NA> crayon 10 <NA>
# Create dataframe as per original question
d<-data.frame(prod_desc=c("BANDELETTE TVTO-OBTRYX HALO", "BANDELETTE MINI ARC PRECISES", "BANDELETTE D'ANALYSE POUR GLYCEMIE", "DIACH. BANDELETTE STER 19MM X 72MM","SLING MALE SYSTEM","DIACHILON","AIGUILLE","GANT","LABEL","CRAYON"),label=1:10)
# Create keywords as per origianl question
kw<-data.frame(kw=c("bandelette","tvt","bande transvaginale","sling system","argus"),category="ss_bandelette")
# Assume you want match/tag string on word boundaries? If not; "BANDELETTE TVTO-OBTRYX HALO" would match to "tvt" for instance.
kw$kw <- paste0("\b",kw$kw,"\b")
x <- sapply(kw$kw, function(x) grepl(tolower(x), tolower(d$prod_desc)))
d$Match <- apply(x, 1, function(i) paste0(names(i)[i]))
d$Match <- kw$category[match(d$Match,kw$kw)]
d
# prod_desc label Match
# 1 BANDELETTE TVTO-OBTRYX HALO 1 ss_bandelette
# 2 BANDELETTE MINI ARC PRECISES 2 ss_bandelette
# 3 BANDELETTE D'ANALYSE POUR GLYCEMIE 3 ss_bandelette
# 4 DIACH. BANDELETTE STER 19MM X 72MM 4 ss_bandelette
# 5 SLING MALE SYSTEM 5 <NA>
# 6 DIACHILON 6 <NA>
# 7 AIGUILLE 7 <NA>
# 8 GANT 8 <NA>
# 9 LABEL 9 <NA>
# 10 CRAYON 10 <NA>