有没有一种简单的方法可以将令牌对象重塑为 quanteda 中的文档？

Question

我正在尝试清理一些文本数据，并且在标记化之后，例如删除标点符号，我想将令牌对象转换为 vector/dataframe/corpus.

我目前的做法是：

library(quanteda)
library(dplyr)

raw <- c("This is text #1.", "And a second document...")
tokens <- raw %>% tokens(remove_punct = T)
docs <- lapply(tokens, toString) %>% gsub(pattern = ",", replacement = "")

是否有更“quanteda”或至少更简单的方法来做到这一点？

Answer 1

这就是我的做法，它将文档名称作为元素名称保留在输出向量中。（但如果您不想保留它们，可以添加 USE.NAMES = FALSE。）

> sapply(tokens, function(x) paste(as.character(x), collapse = " "))
                  text1                   text2 
      "This is text #1" "And a second document"

这里不需要library(dplyr)。

有没有一种简单的方法可以将令牌对象重塑为 quanteda 中的文档？

Is there a simple way to reshape a token object to documents in quanteda?

nlp

r

quanteda