在 R 中用 str_count 计算整个 word/number 次出现

Question

类似于this的情况，我想用stringr包的str_count统计句子向量中出现的多个单词和数字的出现次数。

但我注意到不仅计算整数而且还计算部分数字。例如：

df <- c("honda civic 1988 with new lights","toyota auris 4x4 140000 km","nissan skyline 2.0 159000 km")
keywords <- c("honda","civic","toyota","auris","nissan","skyline","1988","1400","159")
library(stringr)
number_of_keywords_df <- str_count(df, paste(keywords, collapse='|'))

这里我收到一个向量 number_of_keywords_df of 3, 3, 3 而显然，它应该是 3, 2, 2。str_count 函数似乎计算部分字符串“1400”和数字“140000”和“159000”中的“159”。有什么办法可以避免吗？

Answer 1

尝试在您的关键字周围设置单词边界：

keywords <- c("honda","civic","toyota","auris","nissan","skyline","1988","1400","159")
keywords <- paste0("\b", keywords, "\b")

在正则表达式术语中，\bhonda\b 表示匹配孤立的单词 honda。因此 hondas 不会匹配，因为它在末尾有一个额外的字母。

Answer 2

使用 sprintf 可以添加单词边界：

number_of_keywords_df <- str_count(df, paste(sprintf("\b%s\b", keywords), collapse = '|'))
number_of_keywords_df

产生

[1] 3 2 2

在 R 中用 str_count 计算整个 word/number 次出现

Counting whole word/number occurrences with str_count in R

regex

r

stringr