从单独的数据框中搜索关键字

Question

我在下面有一些 R 代码，其中包含我正在使用的 2 个数据帧的示例。 "keywords" df 会定期更改，因此我需要创建一些代码来标记 "mydata" 中段匹配的行，而 mydata$Acct_Name 只需要包含关键字中的单词$KEYWORD 在单元格中的某处。

我开始做一个 FOR 循环，但是当你处理 grepl 和多个数据帧时，事情很快就会变得棘手。我的下一个想法是解析 mydata$Acct_Name，然后尝试在 2 个 dfs.

之间进行合并

非常感谢任何帮助！

虚拟数据

Acct_Name <- c('joes ski shop'
               ,'joes alarm shop'
               ,'joes alarm spot'
               ,'joes bakery'
               ,'joes albergue shop'
               ,'jims Brewery'
               ,'jims albergue place'
               )
Segment <- c('All_Other'
             ,'All_Other'
             ,'All_Other'
             ,'All_Other'
             ,'Apartments'
             ,'Apartments'
             ,'Apartments'
             )

mydata <- data.frame(Acct_Name, Segment)

mydata$Acct_Name <- as.character(mydata$Acct_Name)
mydata$Segment <- as.character(mydata$Segment)


Segment <- c('All_Other'
             ,'All_Other'
             ,'All_Other'
             ,'Apartments'
             ,'Apartments'
             ,'Apartments'
             ,'Apartments'
)
KEYWORD <- c('aislamiento'
             ,'alarm'
             ,'alarma'
             ,'albergue'
             ,'alcantarilla cloaca'
             ,'alcohol'
             ,'almacenamiento'
)

keywords <- data.frame(Segment,KEYWORD)
keywords$FLAG <- 1
keywords$Segment <- as.character(keywords$Segment)
keywords$KEYWORD <- as.character(keywords$KEYWORD)

Answer 1

您想在该组 mydata 的条目中找到该组的任何关键字。我们基本上可以将每个组折叠为一个 or 条件，使用 paste，并指定 collapse = "|"。然后进行合并并使用 grepl 创建一个新的结果列。使用 data.table:

library(data.table)
# make the conditions, collapsing by group
kwords <- as.data.table(keywords)[, KWORD := paste(KEYWORD, collapse = "|"), by = Segment
  ][, .SD[1], by = Segment, .SDcols = c("KWORD")]

# make a column based on the grepl with condition
mydata <- as.data.table(mydata)
kwords[mydata, on = "Segment"][, flag := grepl(KWORD, Acct_Name), by = Acct_Name][]

# output:
# Segment                                               KWORD           Acct_Name  flag
# 1:  All_Other                            aislamiento|alarm|alarma       joes ski shop FALSE
# 2:  All_Other                            aislamiento|alarm|alarma     joes alarm shop  TRUE
# 3:  All_Other                            aislamiento|alarm|alarma     joes alarm spot  TRUE
# 4:  All_Other                            aislamiento|alarm|alarma         joes bakery FALSE
# 5: Apartments albergue|alcantarilla cloaca|alcohol|almacenamiento  joes albergue shop  TRUE
# 6: Apartments albergue|alcantarilla cloaca|alcohol|almacenamiento        jims Brewery FALSE
# 7: Apartments albergue|alcantarilla cloaca|alcohol|almacenamiento jims albergue place  TRUE

编辑：当每组有很多关键字时，另一种可能有效的选择是使用 stringr::str_detect，它在模式上进行了矢量化。像这样：

as.data.table(mydata)[, flag := any(
  stringr::str_detect(Acct_Name, 
                      keywords[keywords$Segment == Segment,"KEYWORD"])), 
  by = Acct_Name][]

# Acct_Name    Segment  flag
# 1:       joes ski shop  All_Other FALSE
# 2:     joes alarm shop  All_Other  TRUE
# 3:     joes alarm spot  All_Other  TRUE
# 4:         joes bakery  All_Other FALSE
# 5:  joes albergue shop Apartments  TRUE
# 6:        jims Brewery Apartments FALSE
# 7: jims albergue place Apartments  TRUE

我们想看看，对于 keywords 的子集，keywords$Segment == mydata$Segment、any 的 keywords$KEYWORD 模式是否可以用 str_detect 在 mydata$Acct_Name。这个解决方案对我来说似乎有点古怪，因为它混合了几种不同的引用列的方法并混合了 data.frame 和 data.table，但它似乎有效。也许这适用于原始数据的大小。

或者，不必重复子集，而是预先制作一个列表并使用它（使用 split 拆分 data.frame，相当于此处的子集，以及 lapply 到仅获取 KEYWORD 列）：

keywords.list <- lapply(split(keywords, keywords$Segment), function(x) x$KEYWORD)

然后通过mydata中Segment的值引用每个方便命名的列表项：

as.data.table(mydata)[, flag := any(
  stringr::str_detect(
    Acct_Name, 
    keywords.list[[Segment]])), by = Acct_Name][]

# Acct_Name    Segment  flag
# 1:       joes ski shop  All_Other FALSE
# 2:     joes alarm shop  All_Other  TRUE
# 3:     joes alarm spot  All_Other  TRUE
# 4:         joes bakery  All_Other FALSE
# 5:  joes albergue shop Apartments  TRUE
# 6:        jims Brewery Apartments FALSE
# 7: jims albergue place Apartments  TRUE

从单独的数据框中搜索关键字

Keyword Search from separate dataframe

r

tokenize

虚拟数据