用于检测大写单词的 Stringr 模式

Stringr pattern to detect capitalized words

我正在尝试编写一个函数来检测全部大写的大写单词

目前,代码:

df <- data.frame(title = character(), id = numeric())%>%
        add_row(title= "THIS is an EXAMPLE where I DONT get the output i WAS hoping for", id = 6)

df <- df %>%
        mutate(sec_code_1 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][1]) 
               , sec_code_2 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][2]) 
               , sec_code_3 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][3]))
df

输出为:

title id sec_code_1 sec_code_2 sec_code_3
THIS is an EXAMPLE where I DONT get the output i WAS hoping for 6 DONT WAS

第一个 3-5 个字母大写的单词是“THIS”,第二个应该跳过示例 (>5) 并且是“DONT”,第三个示例应该是“WAS”。 即:

title id sec_code_1 sec_code_2 sec_code_3
THIS is an EXAMPLE where I DONT get the output i WAS hoping for 6 THIS DONT WANT

有谁知道我哪里出错了?具体来说,我如何使用 stringr.

在逻辑上表示“space 或字符串开头”或“space 或字符串结尾”

如果您 运行 使用正则表达式的代码,您会发现 'THIS' 根本不包含在输出中。

str_extract_all(df$title," [A-Z]{3,5} ")[[1]]
#[1] " DONT " " WAS " 

这是因为您正在提取带有前导和滞后空格的单词。 'THIS' 没有滞后空格,因为它是句子的开头,因此它不满足正则表达式模式。您可以改用字边界 (\b)。

str_extract_all(df$title,"\b[A-Z]{3,5}\b")[[1]]
#[1] "THIS" "DONT" "WAS"

如果您在代码中使用上述模式,您的代码将有效。

或者您也可以使用:

library(tidyverse)

df %>%
  mutate(code = str_extract_all(title,"\b[A-Z]{3,5}\b")) %>%
  unnest_wider(code) %>%
  rename_with(~paste0('sec_code_', seq_along(.)), starts_with('..'))

# title                                     id sec_code_1 sec_code_2 sec_code_3
#  <chr>                                  <dbl> <chr>      <chr>      <chr>     
#1 THIS is an EXAMPLE where I DONT get t…     6 THIS       DONT       WAS