通过 data.table 循环 grepl() (R)
Looping grepl() through data.table (R)
我有一个存储为 data.table DT
的数据集,如下所示:
print(DT)
category industry
1: administration admin
2: nurse practitioner truck
3: trucking truck
4: administration admin
5: warehousing nurse
6: warehousing admin
7: trucking truck
8: nurse practitioner nurse
9: nurse practitioner truck
我想将 table 减少为行业与类别匹配的行。我的一般方法是使用 grepl()
正则表达式匹配字符串 '^{{INDUSTRY}}[a-z ]+$'
和 DT$category
的每一行,并插入 DT$industry
的每个相应行来代替 {{INDUSTRY}}
在正则表达式字符串中使用 infuse()
。我努力寻找一个圆滑的 data.table 解决方案,它可以正确循环遍历 table 并进行行内比较,所以我求助于 for 循环来完成工作:
template <- "^{{IND}}[a-z ]+$"
DT[,match := FALSE,]
for (i in seq(1,length(DT$category))) {
ind <- DT[i]$industry
categ <- d.daily[i]$category
if (grepl(infuse(IND=ind,template),categ)){
DT[i]$match <- TRUE
}
}
DT<- DT[match==TRUE]
print(DT)
category industry
1: administration admin
2: trucking truck
3: administration admin
4: trucking truck
5: nurse practitioner nurse
不过,我相信这可以用更好的方式完成。关于如何利用 data.table 包的功能实现此结果的任何建议?据我了解,在这种情况下,使用包的方法可能比 for 循环更有效。
您可以使用 stringi::stri_detect_fixed()
。它在 str
和 pattern
.
DT[stringi::stri_detect_fixed(category, industry)]
# category industry
# 1: administration admin
# 2: trucking truck
# 3: administration admin
# 4: trucking truck
# 5: nurse practitioner nurse
或者,可以使用 stringr::str_detect()
。它还对其两个参数进行了矢量化。
library(stringr)
DT[str_detect(category, fixed(industry))]
或者基础 R 选项是 运行 grepl()
到 mapply()
DT[mapply(grepl, industry, category, fixed = TRUE)]
或者您可以使用 Vectorize(grepl)
获得相同的结果。
DT[Vectorize(grepl)(industry, category, fixed = TRUE)]
所有这些都会产生相同的结果。
数据:
DT <- structure(list(category = c("administration", "nurse practitioner",
"trucking", "administration", "warehousing", "warehousing", "trucking",
"nurse practitioner", "nurse practitioner"), industry = c("admin",
"truck", "truck", "admin", "nurse", "admin", "truck", "nurse",
"truck")), .Names = c("category", "industry"), class = "data.frame", row.names = c(NA,
-9L))
setDT(DT)
只要匹配始终基于 category
字符串的开头,就可以正常工作:
dt[substring(category, 1, nchar(industry)) == industry]
# category industry
# 1: administration admin
# 2: trucking truck
# 3: administration admin
# 4: trucking truck
# 5: nurse practitioner nurse
Data.table擅长分组运算;我认为这就是它的作用,假设您有很多行与同一行业有关:
DT[ DT[, .I[grep(industry, category)], by = industry]$V1 ]
这使用了the current idiom for subsetting by group, thanks to @eddi .
评论。这些可能有进一步的帮助:
如果您有很多行具有相同的行业类别组合,请尝试
by=.(industry,category)
。尝试用其他东西代替
grep
(例如 Ken 和 Richard 的回答中的选项)。