R：如何删除语料库中特定词以外的词

Question

在语料库 "tkn_pb" 中，我想删除除了我选择的一些关键字（例如 "attack" 和 "gunman"）之外的所有单词。可以这样做吗？

Answer 1

您可以使用 which 和 grepl 对您的语料库进行子集化：

数据：

sample_tokens <- c("word", "another","a", "new", "word token", "one", "more", "and", "another one")

删除除 "a" 和 "and" 之外的所有单词：

sample_tokens[which(grepl("\b(a|and)\b", sample_tokens))]
[1] "a"   "and"

编辑:

如果语料库是一个列表，那么@John 建议的这个解决方案会起作用：

数据：

sample_tokens <- list(c("word", "another","a", "new", "word token", "one", "more", "and", "another one"),
               c("yet", "a", "few", "more", "words"),
               c("and", "so on"))

lapply(sample_tokens, function(x) x[which(grepl("\b(a|and)\b", x))])
[[1]]
[1] "a"   "and"

[[2]]
[1] "a"

[[3]]
[1] "and"

R：如何删除语料库中特定词以外的词

R: How to delete words other than specific words in a corpus

r

corpus

text-mining