R: inspect Document Term Matrix results in Error: Repeated indices currently not allowed
R: inspect Document Term Matrix results in Error: Repeated indices currently not allowed
我有以下虚拟数据:
final6 <- data.frame(docname = paste0("doc", 1:6),
articles = c("Catalonia independence in matter of days",
"Anger over Johnson Libya bodies comment",
"Man admits frenzied mum and son murder",
"The headache that changed my life",
"Las Vegas killer sick, demented - Trump",
"Instagram baby photo scammer banned")
)
我想创建一个引用文档名称的 DocumentTermMatrix(稍后我可以 link 引用原始文章文本)。为此,我遵循 this post 的说明:
myReader <- readTabular(mapping=list(content="articles", id="docname"))
text_corpus <- VCorpus(DataframeSource(final6), readerControl = list(reader = myReader))
# define function that replaces ounctuation with spaces
replacePunctuation <- content_transformer(function(x) {return (gsub("[[:punct:]]"," ", x))}) # replaces punctuation with empty spaces
# remove customised words
myWords <- c("ok", "chat", 'okay', 'day', 'today', "might", "bye", "hello", "thank", "you", "please", "sorry", "hello", "hi")
# clean text
cleantext <- function(corpus){
clean_corpus <- tm_map(corpus, removeNumbers)
clean_corpus <- tm_map(clean_corpus, tolower)
clean_corpus <- tm_map(clean_corpus, PlainTextDocument)
clean_corpus <- tm_map(clean_corpus, replacePunctuation)
clean_corpus <- tm_map(clean_corpus, removePunctuation)
clean_corpus <- tm_map(clean_corpus, removeWords, c(stopwords("english"), myWords, top_names))
clean_corpus <- tm_map(clean_corpus, stripWhitespace)
clean_corpus <- tm_map(clean_corpus, stemDocument, language = "english")
clean_corpus
}
clean_corpus <- cleantext(text_corpus)
# create dtm
chat_DTM <- DocumentTermMatrix(clean_corpus, control = list(wordLengths = c(3, Inf)))
现在,当我想检查矩阵时,出现错误:
inspect(chat_DTM)
Error in [.simple_triplet_matrix
(x, docs, terms) :
Repeated indices currently not allowed.
公平地说,即使我仅基于文本创建语料库并且没有将 doc id 作为属性传递,也会发生此错误。任何想法是什么导致了这个问题?
问题出在从语料库中删除元数据的 PlainTextDocument
函数。如果您按如下方式修改 clean_text
函数,则会生成干净的 DTM,可以在不返回任何错误的情况下对其进行检查:
cleantext <- function(corpus){
clean_corpus <- tm_map(corpus, removeNumbers)
clean_corpus <- tm_map(clean_corpus, content_transformer(tolower)) #!! modified
#clean_corpus <- tm_map(clean_corpus, PlainTextDocument) ### !!!! PlainTextDocument function erases metadata from corpus = document id! So this needs to be erased
clean_corpus <- tm_map(clean_corpus, replacePunctuation)
clean_corpus <- tm_map(clean_corpus, removePunctuation)
clean_corpus <- tm_map(clean_corpus, removeWords, c(stopwords("english"), myWords, top_names))
clean_corpus <- tm_map(clean_corpus, stripWhitespace)
clean_corpus <- tm_map(clean_corpus, stemDocument, language = "english")
clean_corpus
}
clean_corpus <- cleantext(text_corpus)
chat_DTM2 <- DocumentTermMatrix(clean_corpus)
inspect(chat_DTM2)
回答的灵感来自于此solution。谢谢!
如果使用 DirSource(recursive=T, ...)
创建目录源,并且 2 个或更多文件在不同的路径中具有相同的名称,您可能会遇到类似的错误。
在这种情况下,解决方法是:
ds <- DirSource(".", recursive=T)
ovid <- VCorpus(ds)
names(ovid) <- ds$filelist
我有以下虚拟数据:
final6 <- data.frame(docname = paste0("doc", 1:6),
articles = c("Catalonia independence in matter of days",
"Anger over Johnson Libya bodies comment",
"Man admits frenzied mum and son murder",
"The headache that changed my life",
"Las Vegas killer sick, demented - Trump",
"Instagram baby photo scammer banned")
)
我想创建一个引用文档名称的 DocumentTermMatrix(稍后我可以 link 引用原始文章文本)。为此,我遵循 this post 的说明:
myReader <- readTabular(mapping=list(content="articles", id="docname"))
text_corpus <- VCorpus(DataframeSource(final6), readerControl = list(reader = myReader))
# define function that replaces ounctuation with spaces
replacePunctuation <- content_transformer(function(x) {return (gsub("[[:punct:]]"," ", x))}) # replaces punctuation with empty spaces
# remove customised words
myWords <- c("ok", "chat", 'okay', 'day', 'today', "might", "bye", "hello", "thank", "you", "please", "sorry", "hello", "hi")
# clean text
cleantext <- function(corpus){
clean_corpus <- tm_map(corpus, removeNumbers)
clean_corpus <- tm_map(clean_corpus, tolower)
clean_corpus <- tm_map(clean_corpus, PlainTextDocument)
clean_corpus <- tm_map(clean_corpus, replacePunctuation)
clean_corpus <- tm_map(clean_corpus, removePunctuation)
clean_corpus <- tm_map(clean_corpus, removeWords, c(stopwords("english"), myWords, top_names))
clean_corpus <- tm_map(clean_corpus, stripWhitespace)
clean_corpus <- tm_map(clean_corpus, stemDocument, language = "english")
clean_corpus
}
clean_corpus <- cleantext(text_corpus)
# create dtm
chat_DTM <- DocumentTermMatrix(clean_corpus, control = list(wordLengths = c(3, Inf)))
现在,当我想检查矩阵时,出现错误:
inspect(chat_DTM)
Error in
[.simple_triplet_matrix
(x, docs, terms) : Repeated indices currently not allowed.
公平地说,即使我仅基于文本创建语料库并且没有将 doc id 作为属性传递,也会发生此错误。任何想法是什么导致了这个问题?
问题出在从语料库中删除元数据的 PlainTextDocument
函数。如果您按如下方式修改 clean_text
函数,则会生成干净的 DTM,可以在不返回任何错误的情况下对其进行检查:
cleantext <- function(corpus){
clean_corpus <- tm_map(corpus, removeNumbers)
clean_corpus <- tm_map(clean_corpus, content_transformer(tolower)) #!! modified
#clean_corpus <- tm_map(clean_corpus, PlainTextDocument) ### !!!! PlainTextDocument function erases metadata from corpus = document id! So this needs to be erased
clean_corpus <- tm_map(clean_corpus, replacePunctuation)
clean_corpus <- tm_map(clean_corpus, removePunctuation)
clean_corpus <- tm_map(clean_corpus, removeWords, c(stopwords("english"), myWords, top_names))
clean_corpus <- tm_map(clean_corpus, stripWhitespace)
clean_corpus <- tm_map(clean_corpus, stemDocument, language = "english")
clean_corpus
}
clean_corpus <- cleantext(text_corpus)
chat_DTM2 <- DocumentTermMatrix(clean_corpus)
inspect(chat_DTM2)
回答的灵感来自于此solution。谢谢!
如果使用 DirSource(recursive=T, ...)
创建目录源,并且 2 个或更多文件在不同的路径中具有相同的名称,您可能会遇到类似的错误。
在这种情况下,解决方法是:
ds <- DirSource(".", recursive=T)
ovid <- VCorpus(ds)
names(ovid) <- ds$filelist