用 R 和 koRpus 编译和分析语料库

Question

我是一名迷失在数据科学领域的文学专业学生。我正在尝试分析 70 个 .txt 文件的语料库，它们都在一个目录中。

我的最终目标是获得 table 包含文件名（或类似名称）、句子和字数、Flesch-Kincaid 可读性分数和 MTLD 词汇多样性分数。

我找到了 koRpus 和 tm 包（以及 tm.plugin.koRpus）并试图理解它们的文档，但没有取得太大进展。在 RKward IDE 和 koRpus-Plugin 的帮助下，我设法一次获得一个文件的所有这些度量，并且可以手动将该数据复制到 table，但这非常麻烦而且还有很多工作要做。

到目前为止，我尝试过的是创建文件语料库的命令：

simpleCorpus(dir = "/home/user/files/", lang = "en", tagger = "tokenize",
encoding = "UTF-8", pattern = NULL, recursive = FALSE, ignore.case = FALSE, mode = "text", source = "Wikipedia", format = "file",
mc.cores = getOption("mc.cores", 1L))

但我总是得到错误：

Error in data.table(token = tokens, tag = unk.kRp):column or argument 1 is NULL).

如果有人可以帮助 R 的绝对新手，我将不胜感激！

Answer 1

这是一个非常全面的演练...如果我是你，我会逐步完成。

http://tidytextmining.com/tidytext.html

Answer 2

我在包的作者 unDocUMeantIt 的帮助下找到了解决方案（谢谢！）。目录中的一个空文件导致了错误，删除后我设法得到了所有内容运行.

Answer 3

我建议您看一下我们的 quanteda、Digital Humanities Use Case: Replication of analyses from Text Analysis with R for Students of Literature 小插图，它复制了 Matt Jocker 的同名书。

对于您在上面寻找的内容，以下内容会起作用：

require(readtext)
require(quanteda)

# reads in all of your texts and puts them into a corpus
mycorpus <- corpus(readtext("/home/user/files/*"))

# sentence and word counts
(output_df <- summary(mycorpus))

# to compute Flesch-Kincaid readability on the texts
textstat_readability(mycorpus, "Flesch.Kincaid")

# to compute lexical diversity on the texts
textstat_lexdiv(dfm(mycorpus))

textstat_lexdiv() 函数目前没有 MLTD，但我们正在努力，它还有其他六个。

用 R 和 koRpus 编译和分析语料库

Compiling and analysing a Corpus with R and koRpus

r

corpus

tm

korpus