使用R提取字母数字单词和超过1个大写的单词

Question

我是 R 编程的新手，想尝试提取字母数字单词和包含超过 1 个大写字母的单词。

下面是字符串示例和我想要的输出。

    x <- c("123AB123 Electrical CDe FG123-4 ...", 
           "12/1/17 ABCD How are you today A123B", 
           "20.9.12 Eat / Drink XY1234 for PQRS1",
           "Going home H123a1 ab-cd1",
           "Change channel for al1234 to al5678")

    #Desired Output
    #[1] "123AB123 CDe FG123-4"  "ABCD A123B"  "XY1234 PQRS"  
    #[2] "H123a1 ab-cd1"  "al1234 al5678"

到目前为止，我在 Stack Overflow 上遇到了 2 个不同的解决方案：

提取所有包含数字的单词 --> 对我没有帮助，因为我应用该函数的列包含许多日期字符串； “12/1/17 ABCD 你今天好吗 A123B”
识别多于一个的字符串caps/uppercase --> Pierre Lafortune已提供如下解决方案：

    library(stringr)
    str_count(x, "\b[A-Z]{2,}\b")

他的代码提供了一个字符串中超过 1 个大写字母的次数，但除了提取字母数字单词之外，我还想提取这些单词。

如果我的问题或研究不够全面，请见谅。当我可以访问包含 R 和数据集的工作站时，我将 post 我研究的解决方案，用于在 12 小时内提取所有包含数字的单词。

Answer 1

这个有效：

library(stringr)

# split words from strings into one-word-per element vector
y <- unlist(str_split(x, ' '))

# find strings with at least 2 uppercase
uppers <- str_count(y, '[A-Z]')>1

# find strings with at least 1 letter
alphas <- str_detect(y, '[:alpha:]')

# find strings with at least 1 number
nums <- str_detect(y, '[:digit:]')

# subset vector to those that have 2 uppercase OR a letter AND a number
y[uppers | (alphas & nums)]

 [1] "123AB123" "CDe"      "FG123-4"  "ABCD"     "A123B"    "XY1234"  
 [7] "PQRS1"    "H123a1"   "ab-cd1"   "al1234"   "al5678"

Answer 2

单个正则表达式解决方案也可以工作：

> res <- str_extract_all(x, "(?<!\S)(?:(?=\S*\p{L})(?=\S*\d)\S+|(?:\S*\p{Lu}){2}\S*)")
> unlist(res)
 [1] "123AB123" "CDe"      "FG123-4"  "ABCD"     "A123B"    "XY1234"  
 [7] "PQRS1"    "H123a1"   "ab-cd1"   "al1234"   "al5678"

这也适用于使用 PCRE 正则表达式引擎的基础 R 中的 regmatches：

> res2 <- regmatches(x, gregexpr("(?<!\S)(?:(?=\S*\p{L})(?=\S*\d)\S+|(?:\S*\p{Lu}){2}\S*)", x, perl=TRUE))
> unlist(res2)
 [1] "123AB123" "CDe"      "FG123-4"  "ABCD"     "A123B"    "XY1234"  
 [7] "PQRS1"    "H123a1"   "ab-cd1"   "al1234"   "al5678"

为什么有效？

(?<!\S) - 查找空格后或字符串开头的位置
(?: - 定义了两个替代模式的非捕获组的开始：
- (?=\S*\p{L})(?=\S*\d)\S+
  - (?=\S*\p{L}) - 确保在 0+ 个非空白字符后有一个字母（为了更好的性能，将 \S* 替换为 [^\s\p{L}]*）
  - (?=\S*\d) - 确保在 0+ 个非空白字符后有一个数字（为了更好的性能，将 \S* 替换为 [^\s\d]*）
  - \S+ - 匹配 1 个或多个非空白字符
- | - 或
- (?:\S*\p{Lu}){2}\S*:
  - (?:\S*\p{Lu}){2} - 出现 2 次 0+ 非空白字符（\S*，为了更好的性能，替换为 [^\s\p{Lu}]*）后跟 1 个大写字母（\p{Lu} )
  - \S* - 0+ 个非空白字符
) - 非捕获组结束。

要加入与每个字符向量相关的匹配项，您可以使用

unlist(lapply(res, function(c) paste(unlist(c), collapse=" ")))

看到 online R demo.

输出：

[1] "123AB123 CDe FG123-4" "ABCD A123B"           "XY1234 PQRS1"        
[4] "H123a1 ab-cd1"        "al1234 al5678"

使用R提取字母数字单词和超过1个大写的单词

Extract alphanumeric words and words with more than 1 uppercase using R

regex

r

stringr