从文本中提取单词并从中创建一个向量

Extract words from text and create a vector from them

假设,我有一个包含以下文本的 txt 文件:

Type: fruits
Title: retail
Date: 2015-11-10
Country: UK
Products:
  apple,
  passion fruit,
  mango
Documents: NDA
Export: 2.10

我用 readLines 函数读取了这个文件。 然后,我想得到一个如下所示的向量:

x <- c(fruits, apple, passion fruit, mango)

所以,我想提取 "Type:" 之后的单词以及 "Products:" 和 "Documents:" 之间的所有单词。 我怎样才能做到这一点?谢谢!

如果不更改,它看起来接近 yaml 格式,例如使用同名包

library(yaml)
info <- yaml::read_yaml("your file.txt")
# strsplit - split either side of the commas
# unlist - convert to vector
# trimws - remove trailing and leading white space
out <- trimws(unlist(strsplit(info$Products, ",")))

您将在 info 中获得所需名称的其他条目作为列表元素,例如info$Type

也许有一个更优雅的解决方案,如果你可以尝试这个,如果你有一个像这样的向量:

vec <- readLines("path\file.txt")

并且文件中有您发布的文本,您可以试试这个:

# replace biggest spaces
gsub("   "," ",
     # replace the first space
     sub(" ",", ",
       # pattern to extract words
       gsub(".*Type:\s*|Title.*Products:\s*| Documents.*", "",
           # collapse in one vector
           paste0(vec, collapse = " "))))
[1] "fruits, apple, passion fruit, mango"

如果您 dput(vec) 使代码可重现:

c("Type: fruits", "Title: retail", "Date: 2015-11-10", "Country: UK", 
"Products:", "  apple,", "  passion fruit,", "  mango", "Documents: NDA", 
"Export: 2.10")