提取大写单词直到第一个小写字母

Question

我需要提取文本的第一部分，从大写到第一个字母小写。

例如，我有以下文字：“IV LONG TEXT HERE，现在 Text End HERE”

我想提取“IV LONG TEXT HERE”。

我一直在尝试这样的事情：

text <- "IV LONG TEXT HERE and now the Text End HERE"

stringr::str_extract_all(text, "[A-Z]")

但我的正则表达式失败了。

Answer 1

而不是 str_extract 使用 str_replace 或 str_remove

library(stringr)
# match one or more space (\s+) followed by
# one or more lower case letters ([a-z]+) and rest of the characters (.*)
# to remove those matched characters
str_remove(text, "\s+[a-z]+.*")
[1] "IV LONG TEXT HERE"
# or match one or more upper case letters including spaces ([A-Z ]+)
# capture as group `()` followed one or more space (\s+) and rest of
#characters (.*), replace with the backreference (\1) of captured group
str_replace(text, "([A-Z ]+)\s+.*", "\1")
[1] "IV LONG TEXT HERE"

Answer 2

您可以使用 str_extract 和模式来匹配单个大写字符，并可选择匹配 space 和以另一个大写字符结尾的大写字符。

\b[A-Z](?:[A-Z ]*[A-Z])?\b

说明

\b[A-Z] 单词边界以防止部分单词匹配，然后匹配单个字符 A-Z
(?:非捕获组整体匹配
- [A-Z ]*[A-Z] 匹配可选字符 A-Z 或一个 space 并匹配一个字符 A-Z
)?关闭非捕获组并使其可选
\b一个单词边界

例子

text <- "IV LONG TEXT HERE and now the Text End HERE"

stringr::str_extract(text, "\b[A-Z](?:[A-Z ]*[A-Z])?\b")

输出

[1] "IV LONG TEXT HERE"

Answer 3

下面的代码示例应该可以工作。

text <- "IV LONG TEXT HERE and now the Text End HERE"

stringr::str_extract_all(text, "\w.*[A-Z] \b")

输出：

[1] 'IV LONG TEXT HERE '

解释：

Return 出现零次或多次 (.*) 的任何单词字符 (\w) 满足大写 ([A-Z]) 范围并以 space ( \b).

结尾

提取大写单词直到第一个小写字母

Extract uppercase words till the first lowercase letter

regex

r