从文本中提取单词并从中创建一个向量
Extract words from text and create a vector from them
假设,我有一个包含以下文本的 txt 文件:
Type: fruits
Title: retail
Date: 2015-11-10
Country: UK
Products:
apple,
passion fruit,
mango
Documents: NDA
Export: 2.10
我用 readLines
函数读取了这个文件。
然后,我想得到一个如下所示的向量:
x <- c(fruits, apple, passion fruit, mango)
所以,我想提取 "Type:" 之后的单词以及 "Products:" 和 "Documents:" 之间的所有单词。
我怎样才能做到这一点?谢谢!
如果不更改,它看起来接近 yaml
格式,例如使用同名包
library(yaml)
info <- yaml::read_yaml("your file.txt")
# strsplit - split either side of the commas
# unlist - convert to vector
# trimws - remove trailing and leading white space
out <- trimws(unlist(strsplit(info$Products, ",")))
您将在 info
中获得所需名称的其他条目作为列表元素,例如info$Type
也许有一个更优雅的解决方案,如果你可以尝试这个,如果你有一个像这样的向量:
vec <- readLines("path\file.txt")
并且文件中有您发布的文本,您可以试试这个:
# replace biggest spaces
gsub(" "," ",
# replace the first space
sub(" ",", ",
# pattern to extract words
gsub(".*Type:\s*|Title.*Products:\s*| Documents.*", "",
# collapse in one vector
paste0(vec, collapse = " "))))
[1] "fruits, apple, passion fruit, mango"
如果您 dput(vec)
使代码可重现:
c("Type: fruits", "Title: retail", "Date: 2015-11-10", "Country: UK",
"Products:", " apple,", " passion fruit,", " mango", "Documents: NDA",
"Export: 2.10")
假设,我有一个包含以下文本的 txt 文件:
Type: fruits
Title: retail
Date: 2015-11-10
Country: UK
Products:
apple,
passion fruit,
mango
Documents: NDA
Export: 2.10
我用 readLines
函数读取了这个文件。
然后,我想得到一个如下所示的向量:
x <- c(fruits, apple, passion fruit, mango)
所以,我想提取 "Type:" 之后的单词以及 "Products:" 和 "Documents:" 之间的所有单词。 我怎样才能做到这一点?谢谢!
如果不更改,它看起来接近 yaml
格式,例如使用同名包
library(yaml)
info <- yaml::read_yaml("your file.txt")
# strsplit - split either side of the commas
# unlist - convert to vector
# trimws - remove trailing and leading white space
out <- trimws(unlist(strsplit(info$Products, ",")))
您将在 info
中获得所需名称的其他条目作为列表元素,例如info$Type
也许有一个更优雅的解决方案,如果你可以尝试这个,如果你有一个像这样的向量:
vec <- readLines("path\file.txt")
并且文件中有您发布的文本,您可以试试这个:
# replace biggest spaces
gsub(" "," ",
# replace the first space
sub(" ",", ",
# pattern to extract words
gsub(".*Type:\s*|Title.*Products:\s*| Documents.*", "",
# collapse in one vector
paste0(vec, collapse = " "))))
[1] "fruits, apple, passion fruit, mango"
如果您 dput(vec)
使代码可重现:
c("Type: fruits", "Title: retail", "Date: 2015-11-10", "Country: UK",
"Products:", " apple,", " passion fruit,", " mango", "Documents: NDA",
"Export: 2.10")