在不丢失语料库结构的情况下循环遍历 tm 语料库
Loop through a tm corpus without losing corpus structure
我有一个 tm 文档语料库和一个单词列表。我想 运行 对语料库进行 for
循环,以便循环从语料库中按顺序删除列表中的每个单词。
一些复制数据:
library(tm)
m <- cbind(c("Apple blue two","Pear yellow five","Banana yellow two"),
c(1, 2, 3))
tm_corpus <- Corpus(VectorSource(m[,1]))
words <- as.list(c("Apple", "yellow", "two"))
tm_corpus
现在是一个由 3 个文档组成的语料库对象:
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 3
words
是 3 个单词的列表:
[[1]]
[1] "Apple"
[[2]]
[1] "yellow"
[[3]]
[1] "two"
我已经尝试了三种不同的循环。第一个是:
tm_corpusClean <- tm_corpus
for (i in seq_along(tm_corpusClean)) {
for (u in seq_along(words)) {
tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords, words[[u]])
}
}
其中returns以下错误7次(编号1-7):
Error in x$dmeta[i, , drop = FALSE] : incorrect number of dimensions
In addition: Warning messages:
1: In tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords,
words[[u]]) :
number of items to replace is not a multiple of replacement length
2: In tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords,
words[[u]]) :
number of items to replace is not a multiple of replacement length
[...]
第二个是:
tm_corpusClean <- tm_corpus
for (i in seq_along(words)) {
for (u in seq_along(tm_corpusClean)) {
tm_corpusClean[u] <- tm_map(tm_corpusClean[u], removeWords, words[[i]])
}
}
其中returns错误:
Error in x$dmeta[i, , drop = FALSE] : incorrect number of dimensions
最后一个循环是:
tm_corpusClean <- tm_corpus
for (i in seq_along(words)) {
tm_corpusClean <- tm_map(tm_corpusClean, removeWords, words[[i]])
}
这实际上returns一个名为tm_corpusClean
的对象,但是这个对象只是returns第一个文件而不是原来的三个文件:
inspect(tm_corpusClean[[1]])
<<PlainTextDocument>>
Metadata: 7
Content: chars: 6
blue
我哪里错了?
在我们进行顺序删除之前,测试 tm_map
是否适用于您的示例:
obj1 <- tm_map(tm_corpus, removeWords, unlist(words))
sapply(obj1, `[`, "content")
$`1.content`
[1] " blue "
$`2.content`
[1] "Pear five"
$`3.content`
[1] "Banana "
接下来,使用lapply依次删除一个词,即"Apple", "yellow", "two"
:
obj2 <- lapply(words, function(word) tm_map(tm_corpus, removeWords, word))
sapply(obj2, function(x) sapply(x, `[`, "content"))
[,1] [,2] [,3]
1.content " blue two" "Apple blue two" "Apple blue "
2.content "Pear yellow five" "Pear five" "Pear yellow five"
3.content "Banana yellow two" "Banana two" "Banana yellow "
请注意,生成的语料库位于嵌套列表中(使用两个 sapply 来查看内容的原因)。
我有一个 tm 文档语料库和一个单词列表。我想 运行 对语料库进行 for
循环,以便循环从语料库中按顺序删除列表中的每个单词。
一些复制数据:
library(tm)
m <- cbind(c("Apple blue two","Pear yellow five","Banana yellow two"),
c(1, 2, 3))
tm_corpus <- Corpus(VectorSource(m[,1]))
words <- as.list(c("Apple", "yellow", "two"))
tm_corpus
现在是一个由 3 个文档组成的语料库对象:
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 3
words
是 3 个单词的列表:
[[1]]
[1] "Apple"
[[2]]
[1] "yellow"
[[3]]
[1] "two"
我已经尝试了三种不同的循环。第一个是:
tm_corpusClean <- tm_corpus
for (i in seq_along(tm_corpusClean)) {
for (u in seq_along(words)) {
tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords, words[[u]])
}
}
其中returns以下错误7次(编号1-7):
Error in x$dmeta[i, , drop = FALSE] : incorrect number of dimensions
In addition: Warning messages:
1: In tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords,
words[[u]]) :
number of items to replace is not a multiple of replacement length
2: In tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords,
words[[u]]) :
number of items to replace is not a multiple of replacement length
[...]
第二个是:
tm_corpusClean <- tm_corpus
for (i in seq_along(words)) {
for (u in seq_along(tm_corpusClean)) {
tm_corpusClean[u] <- tm_map(tm_corpusClean[u], removeWords, words[[i]])
}
}
其中returns错误:
Error in x$dmeta[i, , drop = FALSE] : incorrect number of dimensions
最后一个循环是:
tm_corpusClean <- tm_corpus
for (i in seq_along(words)) {
tm_corpusClean <- tm_map(tm_corpusClean, removeWords, words[[i]])
}
这实际上returns一个名为tm_corpusClean
的对象,但是这个对象只是returns第一个文件而不是原来的三个文件:
inspect(tm_corpusClean[[1]])
<<PlainTextDocument>>
Metadata: 7
Content: chars: 6
blue
我哪里错了?
在我们进行顺序删除之前,测试 tm_map
是否适用于您的示例:
obj1 <- tm_map(tm_corpus, removeWords, unlist(words))
sapply(obj1, `[`, "content")
$`1.content`
[1] " blue "
$`2.content`
[1] "Pear five"
$`3.content`
[1] "Banana "
接下来,使用lapply依次删除一个词,即"Apple", "yellow", "two"
:
obj2 <- lapply(words, function(word) tm_map(tm_corpus, removeWords, word))
sapply(obj2, function(x) sapply(x, `[`, "content"))
[,1] [,2] [,3]
1.content " blue two" "Apple blue two" "Apple blue "
2.content "Pear yellow five" "Pear five" "Pear yellow five"
3.content "Banana yellow two" "Banana two" "Banana yellow "
请注意,生成的语料库位于嵌套列表中(使用两个 sapply 来查看内容的原因)。