计算列表中关键字的第一个实例,R 中没有重复计数

Count 1st instance of keyword in list with no duplicate counts in R

我有一个关键字列表:

library(stringr)
words <- as.character(c("decomposed", "no diagnosis","decomposition","autolysed","maggots", "poor body", "poor","not suitable", "not possible"))

我想将这些关键字与数据框列 (df$text) 中的文本进行匹配,并计算关键字在不同 data.frame (matchdf) 中出现的次数:

matchdf<- data.frame(Keywords=words)
m_match<-sapply(1:length(words), function(x) sum(str_count(tolower(df$text),words[[x]])))
matchdf$matchs<-m_match

但是,我注意到此方法会计算关键字在列中的每次出现次数。例如)

"The sample was too decomposed to perform an analysis. The decomposed sample indicated that this animal was dead for a long time"

然后 return 计数为 2。但是,我只想计算字段中 "decomposed" 的第一个实例。

我认为有一种方法可以只使用 str_count 计算第一个实例,但似乎没有。

在此示例中,stringr 不是绝对必要的,来自基础 R 的 grepl 就足够了。也就是说,如果您更喜欢包功能(正如@Chi-Pak 在评论中指出的那样)

,请使用 str_detect 而不是 grepl
library(stringr)

words <- c("decomposed", "no diagnosis","decomposition","autolysed","maggots", 
           "poor body", "poor","not suitable", "not possible")

df <- data.frame( text = "The sample was too decomposed to perform an analysis. The decomposed sample indicated that this animal was dead for a long time")

matchdf <- data.frame(Keywords = words, stringsAsFactors = FALSE)

# Base R grepl
matchdf$matches1 <- sapply(1:length(words), function(x) as.numeric(grepl(words[x], tolower(df$text))))

# Stringr function
matchdf$matches2 <- sapply(1:length(words), function(x) as.numeric(str_detect(tolower(df$text),words[[x]])))

matchdf

结果

       Keywords matches1 matches2
1    decomposed        1        1
2  no diagnosis        0        0
3 decomposition        0        0
4     autolysed        0        0
5       maggots        0        0
6     poor body        0        0
7          poor        0        0
8  not suitable        0        0
9  not possible        0        0