用于检测大写单词的 Stringr 模式

Question

我正在尝试编写一个函数来检测全部大写的大写单词

目前，代码：

df <- data.frame(title = character(), id = numeric())%>%
        add_row(title= "THIS is an EXAMPLE where I DONT get the output i WAS hoping for", id = 6)

df <- df %>%
        mutate(sec_code_1 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][1]) 
               , sec_code_2 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][2]) 
               , sec_code_3 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][3]))
df

输出为：

title	id	sec_code_1	sec_code_2	sec_code_3
THIS is an EXAMPLE where I DONT get the output i WAS hoping for	6	DONT	WAS

第一个 3-5 个字母大写的单词是“THIS”，第二个应该跳过示例 (>5) 并且是“DONT”，第三个示例应该是“WAS”。即：

title	id	sec_code_1	sec_code_2	sec_code_3
THIS is an EXAMPLE where I DONT get the output i WAS hoping for	6	THIS	DONT	WANT

有谁知道我哪里出错了？具体来说，我如何使用 stringr.

在逻辑上表示“space 或字符串开头”或“space 或字符串结尾”

Answer 1

如果您运行使用正则表达式的代码，您会发现 'THIS' 根本不包含在输出中。

str_extract_all(df$title," [A-Z]{3,5} ")[[1]]
#[1] " DONT " " WAS "

这是因为您正在提取带有前导和滞后空格的单词。 'THIS' 没有滞后空格，因为它是句子的开头，因此它不满足正则表达式模式。您可以改用字边界 (\b)。

str_extract_all(df$title,"\b[A-Z]{3,5}\b")[[1]]
#[1] "THIS" "DONT" "WAS"

如果您在代码中使用上述模式，您的代码将有效。

或者您也可以使用：

library(tidyverse)

df %>%
  mutate(code = str_extract_all(title,"\b[A-Z]{3,5}\b")) %>%
  unnest_wider(code) %>%
  rename_with(~paste0('sec_code_', seq_along(.)), starts_with('..'))

# title                                     id sec_code_1 sec_code_2 sec_code_3
#  <chr>                                  <dbl> <chr>      <chr>      <chr>     
#1 THIS is an EXAMPLE where I DONT get t…     6 THIS       DONT       WAS

用于检测大写单词的 Stringr 模式

Stringr pattern to detect capitalized words

r

stringr