在R中的字符串后提取指定数量的单词

Question

在下面的示例中，我试图提取字符串“source:”后的 4 个词。

library(stringr)

x <- data.frame(end = c("source: from animal origin as Vitamin A / all-trans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))


x$source = str_extract(x$end, '[^source: ](.)*')

当我尝试上面的代码时，我可以将“source:”之后的所有文本提取到一个新列中。我想知道是否有一种方法可以使用 stringr 或任何其他包仅提取“source”之后的前 4 个单词。

Answer 1

您可以使用：

trimws(stringr::str_extract(x$end, '(?<=source:\s)(\w+,?\s){4}'))
#[1] "from animal origin as"       "Eggs, liver, certain fish"   
#    "Leafy green vegetables such"

?<= 是正后视搜索 'source:' 后跟空格。

我们在它之后捕获 4 个“单词”，包括一个可选的逗号和空格。

在R中的字符串后提取指定数量的单词

Extract specified number of words after a string in R

text-extraction

r

stringr