在R中的字符串后提取指定数量的单词
Extract specified number of words after a string in R
在下面的示例中,我试图提取字符串“source:”后的 4 个词。
library(stringr)
x <- data.frame(end = c("source: from animal origin as Vitamin A / all-trans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))
x$source = str_extract(x$end, '[^source: ](.)*')
当我尝试上面的代码时,我可以将“source:”之后的所有文本提取到一个新列中。我想知道是否有一种方法可以使用 stringr 或任何其他包仅提取“source”之后的前 4 个单词。
您可以使用:
trimws(stringr::str_extract(x$end, '(?<=source:\s)(\w+,?\s){4}'))
#[1] "from animal origin as" "Eggs, liver, certain fish"
# "Leafy green vegetables such"
?<=
是正后视搜索 'source:'
后跟空格。
我们在它之后捕获 4 个“单词”,包括一个可选的逗号和空格。
在下面的示例中,我试图提取字符串“source:”后的 4 个词。
library(stringr)
x <- data.frame(end = c("source: from animal origin as Vitamin A / all-trans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))
x$source = str_extract(x$end, '[^source: ](.)*')
当我尝试上面的代码时,我可以将“source:”之后的所有文本提取到一个新列中。我想知道是否有一种方法可以使用 stringr 或任何其他包仅提取“source”之后的前 4 个单词。
您可以使用:
trimws(stringr::str_extract(x$end, '(?<=source:\s)(\w+,?\s){4}'))
#[1] "from animal origin as" "Eggs, liver, certain fish"
# "Leafy green vegetables such"
?<=
是正后视搜索 'source:'
后跟空格。
我们在它之后捕获 4 个“单词”,包括一个可选的逗号和空格。