使用 Quanteda 清理语料库

Clean corpus using Quanteda

使用 tm 清理语料库的 Quanteda 方法是什么(小写,删除标点,删除数字, 词干)?明确地说,我 不想 使用 dfm() 创建文档特征矩阵,我只想要一个干净的语料库,我可以将其用于特定的下游任务。

# This is what I want to do in quanteda
library("tm")
data("crude")
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, stemDocument)

PS 我知道我可以 quanteda_corpus <- quanteda::corpus(crude) 得到我想要的东西,但我更希望能够在 Quanteda 做所有事情。

我认为你想做的事情在quanteda中是故意不可能的。

当然,您可以使用 tokens* 函数集非常轻松地进行清理,而不会丢失单词的顺序:

library("tm")
data("crude")
library("quanteda")
toks <- corpus(crude) %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE) %>% 
  tokens_wordstem()

print(toks, max_ndoc = 3)
#> Tokens consisting of 20 documents and 15 docvars.
#> reut-00001.xml :
#>  [1] "Diamond"  "Shamrock" "Corp"     "said"     "that"     "effect"  
#>  [7] "today"    "it"       "had"      "cut"      "it"       "contract"
#> [ ... and 78 more ]
#> 
#> reut-00002.xml :
#>  [1] "OPEC"    "may"     "be"      "forc"    "to"      "meet"    "befor"  
#>  [8] "a"       "schedul" "June"    "session" "to"     
#> [ ... and 427 more ]
#> 
#> reut-00004.xml :
#>  [1] "Texaco"   "Canada"   "said"     "it"       "lower"    "the"     
#>  [7] "contract" "price"    "it"       "will"     "pay"      "for"     
#> [ ... and 40 more ]
#> 
#> [ reached max_ndoc ... 17 more documents ]

但是不可能return这个tokens对象进入语料库。现在可以编写一个新函数来执行此操作:

corpus.tokens <- function(x, ...) {
  quanteda:::build_corpus(
    unlist(lapply(x, paste, collapse = " ")),
    docvars = cbind(quanteda:::make_docvars(length(x), docnames(x)), docvars(x))
  )
}

corp <- corpus(toks)
print(corp, max_ndoc = 3)
#> Corpus consisting of 20 documents and 15 docvars.
#> reut-00001.xml :
#> "Diamond Shamrock Corp said that effect today it had cut it c..."
#> 
#> reut-00002.xml :
#> "OPEC may be forc to meet befor a schedul June session to rea..."
#> 
#> reut-00004.xml :
#> "Texaco Canada said it lower the contract price it will pay f..."
#> 
#> [ reached max_ndoc ... 17 more documents ]

但是这个对象,虽然在技术上是一个 corpus class 对象,但它并不是语料库应该是的。来自 ?corpus [重点添加]:

Value

A corpus class object containing the original texts, document-level variables, document-level metadata, corpus-level metadata, and default settings for subsequent processing of the corpus.

以上对象不符合描述,原文已处理。然而,对象的 class 以其他方式进行通信。我看不出有什么理由打破这个逻辑,因为所有后续分析步骤都应该可以使用 tokens*dfm* 函数。