R:如何删除语料库中特定词以外的词
R: How to delete words other than specific words in a corpus
在语料库 "tkn_pb" 中,我想删除除了我选择的一些关键字(例如 "attack" 和 "gunman")之外的所有单词。可以这样做吗?
您可以使用 which
和 grepl
对您的语料库进行子集化:
数据:
sample_tokens <- c("word", "another","a", "new", "word token", "one", "more", "and", "another one")
删除除 "a" 和 "and" 之外的所有单词:
sample_tokens[which(grepl("\b(a|and)\b", sample_tokens))]
[1] "a" "and"
编辑:
如果语料库是一个列表,那么@John 建议的这个解决方案会起作用:
数据:
sample_tokens <- list(c("word", "another","a", "new", "word token", "one", "more", "and", "another one"),
c("yet", "a", "few", "more", "words"),
c("and", "so on"))
lapply(sample_tokens, function(x) x[which(grepl("\b(a|and)\b", x))])
[[1]]
[1] "a" "and"
[[2]]
[1] "a"
[[3]]
[1] "and"
在语料库 "tkn_pb" 中,我想删除除了我选择的一些关键字(例如 "attack" 和 "gunman")之外的所有单词。可以这样做吗?
您可以使用 which
和 grepl
对您的语料库进行子集化:
数据:
sample_tokens <- c("word", "another","a", "new", "word token", "one", "more", "and", "another one")
删除除 "a" 和 "and" 之外的所有单词:
sample_tokens[which(grepl("\b(a|and)\b", sample_tokens))]
[1] "a" "and"
编辑:
如果语料库是一个列表,那么@John 建议的这个解决方案会起作用:
数据:
sample_tokens <- list(c("word", "another","a", "new", "word token", "one", "more", "and", "another one"),
c("yet", "a", "few", "more", "words"),
c("and", "so on"))
lapply(sample_tokens, function(x) x[which(grepl("\b(a|and)\b", x))])
[[1]]
[1] "a" "and"
[[2]]
[1] "a"
[[3]]
[1] "and"