使用 quanteda 逐步创建 dfm

Question

我想分析一个大的 (n=500,000) 文档语料库。我使用 quanteda 是因为 will be faster 比 tm 的 tm_map() 高。我想一步一步地进行，而不是使用 dfm() 的自动化方式。我这样做是有原因的：在一种情况下，我不想在删除停用词之前进行标记化，因为这会导致许多无用的双字母组合，在另一种情况下，我必须使用特定于语言的过程预处理文本。

我想执行这个顺序：
1) 去掉标点和数字
2) 删除停用词（即在标记化之前避免无用的标记）
3) 使用 unigrams 和 bigrams 分词
4) 创建 dfm

我的尝试：

> library(quanteda)
> packageVersion("quanteda")
[1] ‘0.9.8’
> text <- ie2010Corpus$documents$texts
> text.corpus <- quanteda:::corpus(text, docnames=rownames(ie2010Corpus$documents))

> class(text.corpus)
[1] "corpus" "list"

> stopw <- c("a","the", "all", "some")
> TextNoStop <- removeFeatures(text.corpus, features = stopw)
# Error in UseMethod("selectFeatures") : 
# no applicable method for 'selectFeatures' applied to an object of class "c('corpus', 'list')"

# This is how I would theoretically continue: 
> token <- tokenize(TextNoStop, removePunct=TRUE, removeNumbers=TRUE)
> token2 <- ngrams(token,c(1,2))

加分题 如何删除 quanteda 中的稀疏标记？（即相当于 tm.

中的 removeSparseTerms()

更新根据@Ken 的回答，这里是使用 quanteda:

逐步进行的代码

library(quanteda)
packageVersion("quanteda")
[1] ‘0.9.8’

1) 删除自定义标点符号和数字。例如。注意ie2010语料库中的“\n”

text.corpus <- ie2010Corpus
texts(text.corpus)[1]      # Use texts() to extrapolate text
# 2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery.\nIt is

texts(text.corpus)[1] <- gsub("\s"," ",text.corpus[1])    # remove all spaces (incl \n, \t, \r...)
texts(text.corpus)[1]
2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery. It is of e

关于人们可能更喜欢预处理的原因的进一步说明。我现在的语料库是意大利语的，这种语言的文章与带有撇号的单词相关联。因此，直接的 dfm() 会导致不精确的标记化。例如：

broken.tokens <- dfm(corpus(c("L'abile presidente Renzi. Un'abile mossa di Berlusconi"), removePunct=TRUE))

将为同一个词生成两个单独的标记（"un'abile" 和 "l'abile"），因此这里需要使用 gsub() 的额外步骤。

2) 在 quanteda 中，无法在标记化之前直接删除文本中的停用词。在我之前的示例中，必须删除 "l" 和 "un" 以免产生误导性的二元语法。这可以在 tm 和 tm_map(..., removeWords) 中处理。

3) 标记化

token <- tokenize(text.corpus[1], removePunct=TRUE, removeNumbers=TRUE, ngrams = 1:2)

4) 创建 dfm:

dfm <- dfm(token)

5) 移除稀疏特征

dfm <- trim(dfm, minCount = 5)

Answer 1

我们将 dfm() 设计成一把瑞士军刀，而不是一把 "black box" .但是，如果您希望进行更精细的控制，所有这些选项也可以通过较低级别的处理命令使用。

然而，quanteda 的设计原则之一是文本仅通过标记化过程变成 "features"。如果您有一组要排除的标记化特征，则必须首先标记化您的文本，否则您无法排除它们。与 R 的其他文本包（例如 tm）不同，这些步骤是从语料库应用 "downstream"，因此语料库仍然是一组未处理的文本，将对其应用操作（但本身不会是一组转换后的文本）。这样做的目的是为了保持通用性，同时也提高文本分析的可重复性和透明度。

回答您的问题：

然而，您可以使用 texts(myCorpus) <- 函数覆盖我们鼓励的行为，其中分配给文本的内容将覆盖现有文本。因此，您可以使用正则表达式来删除标点符号和数字——例如 stringi 命令并使用 Unicode 类标点符号和数字来识别模式。
我建议您在删除停用词之前进行分词。 Stop "words" 是标记，因此在对文本进行标记化之前无法从文本中删除这些标记。即使应用正则表达式来替换 "" 也涉及在正则表达式中指定某种形式的单词边界——同样，这就是标记化。
要标记为 unigrams 和 bigrams:

tokens(myCorpus, ngrams = 1:2)
要创建 dfm，只需调用 dfm(myTokens)。（您也可以在此阶段为 ngram 应用步骤 3。

奖励 1：n=2 搭配生成与双字母组相同的列表，只是格式不同。你是不是另有打算？（也许单独的 SO 问题？）

奖励 2：参见 dfm_trim(x, sparsity = )。 removeSparseTerms() 选项让大多数人感到困惑，但这包括来自 tm 的移民。有关完整说明，请参阅。

顺便说一句：使用 texts() 而不是 ie2010Corpus$documents$texts -- 我们将很快重写语料库的对象结构，所以当有提取函数时，你不应该以这种方式访问它的内部。（此外，这一步是不必要的 - 这里您只是重新创建了语料库。）

更新 2018-01:

语料库对象的新名称为data_corpus_irishbudget2010，搭配评分函数为textstat_collocations()。

使用 quanteda 逐步创建 dfm

Create dfm step by step with quanteda

r

text-analysis

term-document-matrix

quanteda