在正则表达式中使用向量来提取仅具有已知开始和结束的子字符串

Question

如何在正则表达式中使用向量，以便将向量中的所有内容提取到另一个词？

我需要使用 str_match 从数据框中的一系列大字符串中提取多个子字符串。每个子字符串都以树种开头，以单词“links”结尾。由于我需要的子字符串可以从许多不同的物种开始，我创建了一个名为 tree.sp 的向量来包含所有可能性。

test.df <- data.frame(
  Heading_ChLk = c("West", 40.00, 80.00),
  Bound_Desc = c("On the Base line along the south side of section 34 T 1 N, R 29 W of the 5th PM.",
                 "Set a 1/4 section corner post from which a pine 9 inches diameter bears N 43 E 35 links and a black oak 15 inches diameter bears S 10 E 30 links",
                 "Set a post corner to sections 33 & 34 from which a white oak 17 inches diameter bears N 32 W 57 links and a black oak 20 inches diameter bears N 46 E 19 links.")
)

tree.sp <- c("pine|black oak|white oak")

Answer 1

您可以使用 str_match 作为 -

library(stringr)

test.df$result <- str_match(test.df$Bound_Desc, sprintf('((%s).*links)', tree.sp))[, 2]
test.df$result

#[1] NA                                                                                                           
#[2] "pine 9 inches diameter bears N 43 E 35 links and a black oak 15 inches diameter bears S 10 E 30 links"      
#[3] "white oak 17 inches diameter bears N 32 W 57 links and a black oak 20 inches diameter bears N 46 E 19 links"

类似的代码也可以与str_extract一起使用-

str_extract(test.df$Bound_Desc, sprintf('(%s).*links', tree.sp))

在正则表达式中使用向量来提取仅具有已知开始和结束的子字符串

Using vector in a regex to extract substrings with only known start & end

regex

substring

r

stringr