提取一列中的大写字母序列，并用R中的str_extract替换为新截断的字符串

Question

我有以下包含括号、句点和不必要的描述性词的字符向量

strings <- c("Poorly Graded Silty Sand (SP-SM).", "(Visual) Lean Clay (CL), with some sand.","Poorly Graded Silty Sand (SP-SM).","(Visual) Inorganic Silt (ML).","(Visual) Lean Clay (CL), with some sand.")

我希望仅提取每行括号内的字母编码系统（例如：ML 或 SP-SM）。这是所需的向量。

need <- c("SP-SM", "CL","SP-SM","ML","CL")

这可能吗？

Answer 1

我们可以使用 str_extract 和正则表达式环视来匹配左括号后跟一个或多个大写字母 -，然后是右括号

library(stringr)
str_extract(strings, "(?<=\()[A-Z-]+(?=\))")
[1] "SP-SM" "CL"    "SP-SM" "ML"    "CL"

Answer 2

这是 akrun 解决方案的长版：

str_extract(strings, '\b[A-Z]{2}\b\-\b[A-Z]{2}\b|\b[A-Z]{2}\b')

输出：

[1] "SP-SM" "CL"    "SP-SM" "ML"    "CL"

解释：

[A-Z]{2} 恰好匹配两个大写字母。

\- 匹配连字符。

\b 单词字符和非单词字符之间的匹配。

| 定义 OR

Answer 3

这可能是基础 R 中的另一个选项：

unlist(regmatches(strings, gregexpr("(?<=\()[[:upper:]]{1,}(-[[:upper:]]{1,})?(?=\))", strings, perl = TRUE)))

[1] "SP-SM" "CL"    "SP-SM" "ML"    "CL"

请注意，我对第二个可能的子字符串使用了可选字符串 ?，因为它可能不存在： (-[[:upper:]]{1,})?

(?<=\() 正面回顾。它匹配前面有括号 (

的任何字符串

(?<=\() 正面前瞻。它匹配任何后跟括号 )

的字符串

[[:upper:]]{1,} 匹配任意多于 1 个大写字母

提取一列中的大写字母序列，并用R中的str_extract替换为新截断的字符串

Extract the capitalized letter sequence in a column and replace the column with the newly truncated string with str_extract in R

regex

r

stringr