R:如何删除语料库中特定词以外的词

R: How to delete words other than specific words in a corpus

在语料库 "tkn_pb" 中,我想删除除了我选择的一些关键字(例如 "attack" 和 "gunman")之外的所有单词。可以这样做吗?

您可以使用 whichgrepl 对您的语料库进行子集化:

数据:

sample_tokens <- c("word", "another","a", "new", "word token", "one", "more", "and", "another one")

删除除 "a" 和 "and" 之外的所有单词:

sample_tokens[which(grepl("\b(a|and)\b", sample_tokens))]
[1] "a"   "and"

编辑:

如果语料库是一个列表,那么@John 建议的这个解决方案会起作用:

数据:

sample_tokens <- list(c("word", "another","a", "new", "word token", "one", "more", "and", "another one"),
               c("yet", "a", "few", "more", "words"),
               c("and", "so on"))

lapply(sample_tokens, function(x) x[which(grepl("\b(a|and)\b", x))])
[[1]]
[1] "a"   "and"

[[2]]
[1] "a"

[[3]]
[1] "and"