使用 Quanteda 清理语料库
Clean corpus using Quanteda
使用 tm 清理语料库的 Quanteda 方法是什么(小写,删除标点,删除数字, 词干)?明确地说,我 不想 使用 dfm()
创建文档特征矩阵,我只想要一个干净的语料库,我可以将其用于特定的下游任务。
# This is what I want to do in quanteda
library("tm")
data("crude")
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, stemDocument)
PS 我知道我可以 quanteda_corpus <- quanteda::corpus(crude)
得到我想要的东西,但我更希望能够在 Quanteda 做所有事情。
我认为你想做的事情在quanteda
中是故意不可能的。
当然,您可以使用 tokens*
函数集非常轻松地进行清理,而不会丢失单词的顺序:
library("tm")
data("crude")
library("quanteda")
toks <- corpus(crude) %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
tokens_wordstem()
print(toks, max_ndoc = 3)
#> Tokens consisting of 20 documents and 15 docvars.
#> reut-00001.xml :
#> [1] "Diamond" "Shamrock" "Corp" "said" "that" "effect"
#> [7] "today" "it" "had" "cut" "it" "contract"
#> [ ... and 78 more ]
#>
#> reut-00002.xml :
#> [1] "OPEC" "may" "be" "forc" "to" "meet" "befor"
#> [8] "a" "schedul" "June" "session" "to"
#> [ ... and 427 more ]
#>
#> reut-00004.xml :
#> [1] "Texaco" "Canada" "said" "it" "lower" "the"
#> [7] "contract" "price" "it" "will" "pay" "for"
#> [ ... and 40 more ]
#>
#> [ reached max_ndoc ... 17 more documents ]
但是不可能return这个tokens
对象进入语料库。现在可以编写一个新函数来执行此操作:
corpus.tokens <- function(x, ...) {
quanteda:::build_corpus(
unlist(lapply(x, paste, collapse = " ")),
docvars = cbind(quanteda:::make_docvars(length(x), docnames(x)), docvars(x))
)
}
corp <- corpus(toks)
print(corp, max_ndoc = 3)
#> Corpus consisting of 20 documents and 15 docvars.
#> reut-00001.xml :
#> "Diamond Shamrock Corp said that effect today it had cut it c..."
#>
#> reut-00002.xml :
#> "OPEC may be forc to meet befor a schedul June session to rea..."
#>
#> reut-00004.xml :
#> "Texaco Canada said it lower the contract price it will pay f..."
#>
#> [ reached max_ndoc ... 17 more documents ]
但是这个对象,虽然在技术上是一个 corpus
class 对象,但它并不是语料库应该是的。来自 ?corpus
[重点添加]:
Value
A corpus class object containing the original texts, document-level variables, document-level metadata, corpus-level
metadata, and default settings for subsequent processing of the
corpus.
以上对象不符合描述,原文已处理。然而,对象的 class 以其他方式进行通信。我看不出有什么理由打破这个逻辑,因为所有后续分析步骤都应该可以使用 tokens*
或 dfm*
函数。
使用 tm 清理语料库的 Quanteda 方法是什么(小写,删除标点,删除数字, 词干)?明确地说,我 不想 使用 dfm()
创建文档特征矩阵,我只想要一个干净的语料库,我可以将其用于特定的下游任务。
# This is what I want to do in quanteda
library("tm")
data("crude")
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, stemDocument)
PS 我知道我可以 quanteda_corpus <- quanteda::corpus(crude)
得到我想要的东西,但我更希望能够在 Quanteda 做所有事情。
我认为你想做的事情在quanteda
中是故意不可能的。
当然,您可以使用 tokens*
函数集非常轻松地进行清理,而不会丢失单词的顺序:
library("tm")
data("crude")
library("quanteda")
toks <- corpus(crude) %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
tokens_wordstem()
print(toks, max_ndoc = 3)
#> Tokens consisting of 20 documents and 15 docvars.
#> reut-00001.xml :
#> [1] "Diamond" "Shamrock" "Corp" "said" "that" "effect"
#> [7] "today" "it" "had" "cut" "it" "contract"
#> [ ... and 78 more ]
#>
#> reut-00002.xml :
#> [1] "OPEC" "may" "be" "forc" "to" "meet" "befor"
#> [8] "a" "schedul" "June" "session" "to"
#> [ ... and 427 more ]
#>
#> reut-00004.xml :
#> [1] "Texaco" "Canada" "said" "it" "lower" "the"
#> [7] "contract" "price" "it" "will" "pay" "for"
#> [ ... and 40 more ]
#>
#> [ reached max_ndoc ... 17 more documents ]
但是不可能return这个tokens
对象进入语料库。现在可以编写一个新函数来执行此操作:
corpus.tokens <- function(x, ...) {
quanteda:::build_corpus(
unlist(lapply(x, paste, collapse = " ")),
docvars = cbind(quanteda:::make_docvars(length(x), docnames(x)), docvars(x))
)
}
corp <- corpus(toks)
print(corp, max_ndoc = 3)
#> Corpus consisting of 20 documents and 15 docvars.
#> reut-00001.xml :
#> "Diamond Shamrock Corp said that effect today it had cut it c..."
#>
#> reut-00002.xml :
#> "OPEC may be forc to meet befor a schedul June session to rea..."
#>
#> reut-00004.xml :
#> "Texaco Canada said it lower the contract price it will pay f..."
#>
#> [ reached max_ndoc ... 17 more documents ]
但是这个对象,虽然在技术上是一个 corpus
class 对象,但它并不是语料库应该是的。来自 ?corpus
[重点添加]:
Value
A corpus class object containing the original texts, document-level variables, document-level metadata, corpus-level metadata, and default settings for subsequent processing of the corpus.
以上对象不符合描述,原文已处理。然而,对象的 class 以其他方式进行通信。我看不出有什么理由打破这个逻辑,因为所有后续分析步骤都应该可以使用 tokens*
或 dfm*
函数。