将 tm Vcor​​pus 导入 Quanteda 语料库时出错

Error when importing tm Vcorpus into Quanteda corpus

在我昨天决定更新 R(3.6.3) 和 RStudio(1.2.5042) 之前,这段代码片段工作得很好,尽管对我来说这不是问题的根源。

简而言之,我将 91 个 pdf 文件转换为名为 Vcor​​p 的易失性语料库,并确认我创建了一个易失性语料库,如下所示:

> Vcorp <- VCorpus(VectorSource(citiesText)) 
> class(Vcorp)
[1] "VCorpus" "Corpus" 

然后我尝试将这个 tm Vcor​​pus 导入 quanteda,但不断收到错误消息,这是我之前没有收到的(例如更新前一天)。

> data(Vcorp, package = "tm")   
> citiesCorpus <- corpus(Vcorp)
Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 8714, 91 

有什么建议吗?谢谢。

如果没有 a) 软件包的版本信息和 b) 可重现的示例,就不可能知道确切的问题。

为什么要使用 tm?您可以直接创建 quanteda 语料库:

corpus(citiesText)

转换 VCorpus 对我来说很好。

library("quanteda")
## Package version: 2.0.1

library("tm")
packageVersion("tm")
## [1] ‘0.7.7’

reut21578 <- system.file("texts", "crude", package = "tm")
VCorp <- VCorpus(
  DirSource(reut21578, mode = "binary"),
  list(reader = readReut21578XMLasPlain)
)

corpus(VCorp)
## Corpus consisting of 20 documents and 16 docvars.
## text1 :
## "Diamond Shamrock Corp said that effective today it had cut i..."
## 
## text2 :
## "OPEC may be forced to meet before a scheduled June session t..."
## 
## text3 :
## "Texaco Canada said it lowered the contract price it will pay..."
## 
## text4 :
## "Marathon Petroleum Co said it reduced the contract price it ..."
## 
## text5 :
## "Houston Oil Trust said that independent petroleum engineers ..."
## 
## text6 :
## "Kuwait"s Oil Minister, in remarks published today, said ther..."
## 
## [ reached max_ndoc ... 14 more documents ]