使用 R 查找数据中的缩写

Question

在我的数据（文本）中，有缩写。

有没有在文本中搜索缩写的函数或代码？例如，检测 3-4-5 大写字母缩写并让我计算它们出现的频率。

非常感谢！

Answer 1

您可以使用正则表达式 [A-Z] 来匹配任何出现的大写字母。如果您希望此模式重复 3 次，您可以将 {3} 添加到您的正则表达式中。考虑使用变量和循环来完成 3 到 5 次重复的工作。

Answer 2

detecting 3-4-5 capital letter abbreviations

您可以使用

\b[A-Z]{3,5}\b

见regex demo

详情:

\b - 单词边界
[A-Z]{3,5} - 3、4 或 5 个大写字母（也使用 [[:upper:]] 来匹配 ASCII 以外的字母）
\b - 单词边界。

R demo online (leveraging the regex occurrence count code from @TheComeOnMan)

abbrev_regex <- "\b[A-Z]{3,5}\b";
x <- "XYZ was seen at WXYZ with VWXYZ and did ABCDEFGH."
sum(gregexpr(abbrev_regex,x)[[1]] > 0)
## => [1] 3
regmatches(x, gregexpr(abbrev_regex, x))[[1]]
## => [1] "XYZ"   "WXYZ"  "VWXYZ"

使用 R 查找数据中的缩写

Finding Abbreviations in Data with R

regex

r

stringr

tidyr

tidytext