从文本中提取单词并从中创建一个向量

Question

假设，我有一个包含以下文本的 txt 文件：

Type: fruits
Title: retail
Date: 2015-11-10
Country: UK
Products:
  apple,
  passion fruit,
  mango
Documents: NDA
Export: 2.10

我用 readLines 函数读取了这个文件。然后，我想得到一个如下所示的向量：

x <- c(fruits, apple, passion fruit, mango)

所以，我想提取 "Type:" 之后的单词以及 "Products:" 和 "Documents:" 之间的所有单词。我怎样才能做到这一点？谢谢！

Answer 1

如果不更改，它看起来接近 yaml 格式，例如使用同名包

library(yaml)
info <- yaml::read_yaml("your file.txt")
# strsplit - split either side of the commas
# unlist - convert to vector
# trimws - remove trailing and leading white space
out <- trimws(unlist(strsplit(info$Products, ",")))

您将在 info 中获得所需名称的其他条目作为列表元素，例如info$Type

Answer 2

也许有一个更优雅的解决方案，如果你可以尝试这个，如果你有一个像这样的向量：

vec <- readLines("path\file.txt")

并且文件中有您发布的文本，您可以试试这个：

# replace biggest spaces
gsub("   "," ",
     # replace the first space
     sub(" ",", ",
       # pattern to extract words
       gsub(".*Type:\s*|Title.*Products:\s*| Documents.*", "",
           # collapse in one vector
           paste0(vec, collapse = " "))))
[1] "fruits, apple, passion fruit, mango"

如果您 dput(vec) 使代码可重现：

c("Type: fruits", "Title: retail", "Date: 2015-11-10", "Country: UK", 
"Products:", "  apple,", "  passion fruit,", "  mango", "Documents: NDA", 
"Export: 2.10")

从文本中提取单词并从中创建一个向量

Extract words from text and create a vector from them

regex

text-processing

r

gsub

stringr