计算列表中关键字的第一个实例，R 中没有重复计数

Question

我有一个关键字列表：

library(stringr)
words <- as.character(c("decomposed", "no diagnosis","decomposition","autolysed","maggots", "poor body", "poor","not suitable", "not possible"))

我想将这些关键字与数据框列 (df$text) 中的文本进行匹配，并计算关键字在不同 data.frame (matchdf) 中出现的次数：

matchdf<- data.frame(Keywords=words)
m_match<-sapply(1:length(words), function(x) sum(str_count(tolower(df$text),words[[x]])))
matchdf$matchs<-m_match

但是，我注意到此方法会计算关键字在列中的每次出现次数。例如）

"The sample was too decomposed to perform an analysis. The decomposed sample indicated that this animal was dead for a long time"

然后 return 计数为 2。但是，我只想计算字段中 "decomposed" 的第一个实例。

我认为有一种方法可以只使用 str_count 计算第一个实例，但似乎没有。

Answer 1

在此示例中，stringr 不是绝对必要的，来自基础 R 的 grepl 就足够了。也就是说，如果您更喜欢包功能（正如@Chi-Pak 在评论中指出的那样）

，请使用 str_detect 而不是 grepl

library(stringr)

words <- c("decomposed", "no diagnosis","decomposition","autolysed","maggots", 
           "poor body", "poor","not suitable", "not possible")

df <- data.frame( text = "The sample was too decomposed to perform an analysis. The decomposed sample indicated that this animal was dead for a long time")

matchdf <- data.frame(Keywords = words, stringsAsFactors = FALSE)

# Base R grepl
matchdf$matches1 <- sapply(1:length(words), function(x) as.numeric(grepl(words[x], tolower(df$text))))

# Stringr function
matchdf$matches2 <- sapply(1:length(words), function(x) as.numeric(str_detect(tolower(df$text),words[[x]])))

matchdf

结果

       Keywords matches1 matches2
1    decomposed        1        1
2  no diagnosis        0        0
3 decomposition        0        0
4     autolysed        0        0
5       maggots        0        0
6     poor body        0        0
7          poor        0        0
8  not suitable        0        0
9  not possible        0        0

计算列表中关键字的第一个实例，R 中没有重复计数

Count 1st instance of keyword in list with no duplicate counts in R

r

web-scraping

stringr