R：在将 quanteda 语料库与 readtext 一起使用时遇到问题

Question

使用Quanteda包阅读我的语料库后，在使用各种后续语句时出现相同的错误：

Error in UseMethod("texts") : no applicable method for 'texts' applied to an object of class "c('corpus_frame', 'data.frame')").

例如，当使用这个简单的语句时：texts(mycorpus)[2] 我的实际目标是创建一个 dfm（它给我与上面相同的错误消息）。

我用这段代码阅读了语料库：

`mycorpus < corpus_frame(readtext("C:/Users/renswilderom/Documents/Stuff Im 
working on at the moment/Newspaper articles DJ/test data/*.txt", 
docvarsfrom="filenames", dvsep="_", docvarnames=c("Date of Publication", 
"Length LexisNexis"), encoding = "UTF-8-BOM"))`

我的数据集包含 50 篇报纸文章，包括一些元数据，例如出版日期。

见截图。

为什么我每次都会收到这个错误？非常感谢您的帮助！

响应 1：

当仅使用 readtext() 时，我更进一步并且 texts(text.corpus)[1] 不会产生错误。

但是分词的时候，同样的错误又出现了，所以：

token <- tokenize(text.corpus, removePunct=TRUE, removeNumbers=TRUE, ngrams 
= 1:2)
tokens(text.corpus)

产量：

Error in UseMethod("tokenize") : no applicable method for 'tokenize' applied to an object of class "c('readtext', 'data.frame')"

Error in UseMethod("tokens") : no applicable method for 'tokens' applied to an object of class "c('readtext', 'data.frame')"

回复2：

现在我在 return 中收到了这两条错误消息，我最初也收到了，所以我开始使用 corpus_frame()

Error in UseMethod("tokens") : no applicable method for 'tokens' applied to an object of class "c('corpus_frame', 'data.frame')"

In addition: Warning message: 'corpus' is deprecated. Use 'corpus_frame' instead. See help("Deprecated")

我是否需要指定 'tokenization' 或任何其他步骤仅应用于 'text' 列而不是整个数据集？

响应 3：

谢谢你，帕特里克，这确实澄清了我并让我更进一步。当运行这个：

# Quanteda - corpus way
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
         docvarsfrom = "filenames", dvsep = "_", 
         docvarnames = c("Date of Publication", "Length LexisNexis", "source"), 
         encoding = "UTF-8-BOM")  %>%
  corpus() %>%
  tokens(removePunct = TRUE, removeNumbers = TRUE, ngrams = 1:2)

我明白了：

Error in tokens_internal(texts(x), ...) : the ... list does not contain 3 elements In addition: Warning message: removePunctremoveNumbers is deprecated; use remove_punctremove_numbers instead

所以我相应地更改了它（使用 remove_punct 和 remove_numbers），现在代码运行良好。

或者，我也试过这个：

# Corpus - term_matrix way
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
         docvarsfrom = "filenames", dvsep = "_", 
         docvarnames = c("Date of Publication", "Length LexisNexis", "source"), 
         encoding = "UTF-8-BOM")  %>%
  term_matrix(drop_punct = TRUE, drop_numbers = TRUE, ngrams = 1:2)

这给出了这个错误：

Error in term_matrix(., drop_punct = TRUE, drop_numbers = TRUE, ngrams = 1:2) : unrecognized text filter property: 'drop_numbers'

去掉drop_numbers = TRUE后，矩阵才真正产生。非常感谢您的帮助！

Answer 1

好的，您收到此错误是因为（如错误消息所述）readtext 对象 class 没有 tokens() 方法，它是 [=24= 的特殊版本]. （注意：tokenize() 是较旧的、已弃用的语法，将在下一版本中删除 - 请改用 tokens()。）

你想要这个：

library("quanteda")
library("readtext")
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
         docvarsfrom = "filenames", dvsep = "_", 
         docvarnames = c("Date of Publication", "Length LexisNexis"), 
         encoding = "UTF-8-BOM"))  %>%
    corpus() %>%
    tokens(removePunct = TRUE, removeNumbers = TRUE, ngrams = 1:2)

这是您遗漏的 corpus() 步骤。 corpus_frame() 来自不同的包（我朋友 Patrick Perry 的 corpus）。

Answer 2

澄清情况：

corpus 包的 0.9.1 版有一个名为 corpus 的函数。 quanteda 还有一个函数叫做 corpus。为了避免两个包之间的名称冲突，corpus corpus 函数在 0.9.2 版本中被弃用并重命名为 corpus_frame；它在版本 0.9.3 中被删除。

为了避免与 quanteda 的名称冲突，请将 corpus 升级到 CRAN 上的最新版本 (0.9.3)，或者否则做

library(corpus)
library(quanteda)

而不是其他顺序。

现在，如果您想使用 quanteda 来标记您的文本，请遵循 Ken 的回答中给出的建议：

readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
     docvarsfrom = "filenames", dvsep = "_", 
     docvarnames = c("Date of Publication", "Length LexisNexis"), 
     encoding = "UTF-8-BOM"))  %>%
    corpus() %>%
    tokens(remove_punct = TRUE, remove_numbers = TRUE, ngrams = 1:2)

如果您的目标是获取逐个文档的计数矩阵，您可能希望使用 dfm 函数而不是 tokens 函数。

如果你想使用 corpus 包，改为

readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
     docvarsfrom = "filenames", dvsep = "_", 
     docvarnames = c("Date of Publication", "Length LexisNexis"), 
     encoding = "UTF-8-BOM"))  %>%
    term_matrix(drop_punct = TRUE, drop_number = TRUE, ngrams = 1:2)

根据您要执行的操作，您可能希望使用 term_stats 函数而不是 term_matrix 函数。

R：在将 quanteda 语料库与 readtext 一起使用时遇到问题

R: having trouble using quanteda corpus with readtext

r

corpus

quanteda