用于检测大写单词的 Stringr 模式
Stringr pattern to detect capitalized words
我正在尝试编写一个函数来检测全部大写的大写单词
目前,代码:
df <- data.frame(title = character(), id = numeric())%>%
add_row(title= "THIS is an EXAMPLE where I DONT get the output i WAS hoping for", id = 6)
df <- df %>%
mutate(sec_code_1 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][1])
, sec_code_2 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][2])
, sec_code_3 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][3]))
df
输出为:
title
id
sec_code_1
sec_code_2
sec_code_3
THIS is an EXAMPLE where I DONT get the output i WAS hoping for
6
DONT
WAS
第一个 3-5 个字母大写的单词是“THIS”,第二个应该跳过示例 (>5) 并且是“DONT”,第三个示例应该是“WAS”。
即:
title
id
sec_code_1
sec_code_2
sec_code_3
THIS is an EXAMPLE where I DONT get the output i WAS hoping for
6
THIS
DONT
WANT
有谁知道我哪里出错了?具体来说,我如何使用 stringr.
在逻辑上表示“space 或字符串开头”或“space 或字符串结尾”
如果您 运行 使用正则表达式的代码,您会发现 'THIS'
根本不包含在输出中。
str_extract_all(df$title," [A-Z]{3,5} ")[[1]]
#[1] " DONT " " WAS "
这是因为您正在提取带有前导和滞后空格的单词。 'THIS'
没有滞后空格,因为它是句子的开头,因此它不满足正则表达式模式。您可以改用字边界 (\b
)。
str_extract_all(df$title,"\b[A-Z]{3,5}\b")[[1]]
#[1] "THIS" "DONT" "WAS"
如果您在代码中使用上述模式,您的代码将有效。
或者您也可以使用:
library(tidyverse)
df %>%
mutate(code = str_extract_all(title,"\b[A-Z]{3,5}\b")) %>%
unnest_wider(code) %>%
rename_with(~paste0('sec_code_', seq_along(.)), starts_with('..'))
# title id sec_code_1 sec_code_2 sec_code_3
# <chr> <dbl> <chr> <chr> <chr>
#1 THIS is an EXAMPLE where I DONT get t… 6 THIS DONT WAS
我正在尝试编写一个函数来检测全部大写的大写单词
目前,代码:
df <- data.frame(title = character(), id = numeric())%>%
add_row(title= "THIS is an EXAMPLE where I DONT get the output i WAS hoping for", id = 6)
df <- df %>%
mutate(sec_code_1 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][1])
, sec_code_2 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][2])
, sec_code_3 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][3]))
df
输出为:
title | id | sec_code_1 | sec_code_2 | sec_code_3 |
---|---|---|---|---|
THIS is an EXAMPLE where I DONT get the output i WAS hoping for | 6 | DONT | WAS |
第一个 3-5 个字母大写的单词是“THIS”,第二个应该跳过示例 (>5) 并且是“DONT”,第三个示例应该是“WAS”。 即:
title | id | sec_code_1 | sec_code_2 | sec_code_3 |
---|---|---|---|---|
THIS is an EXAMPLE where I DONT get the output i WAS hoping for | 6 | THIS | DONT | WANT |
有谁知道我哪里出错了?具体来说,我如何使用 stringr.
在逻辑上表示“space 或字符串开头”或“space 或字符串结尾”如果您 运行 使用正则表达式的代码,您会发现 'THIS'
根本不包含在输出中。
str_extract_all(df$title," [A-Z]{3,5} ")[[1]]
#[1] " DONT " " WAS "
这是因为您正在提取带有前导和滞后空格的单词。 'THIS'
没有滞后空格,因为它是句子的开头,因此它不满足正则表达式模式。您可以改用字边界 (\b
)。
str_extract_all(df$title,"\b[A-Z]{3,5}\b")[[1]]
#[1] "THIS" "DONT" "WAS"
如果您在代码中使用上述模式,您的代码将有效。
或者您也可以使用:
library(tidyverse)
df %>%
mutate(code = str_extract_all(title,"\b[A-Z]{3,5}\b")) %>%
unnest_wider(code) %>%
rename_with(~paste0('sec_code_', seq_along(.)), starts_with('..'))
# title id sec_code_1 sec_code_2 sec_code_3
# <chr> <dbl> <chr> <chr> <chr>
#1 THIS is an EXAMPLE where I DONT get t… 6 THIS DONT WAS