是否可以在 R 中的 textcnt 函数的输出中保持 ngram 的顺序？

Question

我正在使用 tau 包中的 textcnt() 函数来获取二元语法，如下所示：

sentence <- "A sample sentence in English for testing purpose"
english <- textcnt(sentence, method = "string", n=2, tolower = FALSE)

二元组 returned 按字母顺序排列，如下所示：

 A sample     English for     for testing      in English sample sentence     sentence in testing purpose

不过，我正在寻找一种解决方案，可以 return 双字母按句子中出现的顺序排列。更准确地说，所需的输出如下：

 A sample  sample sentence sentence in  in English  English for  for testing   testing purpose

如果 textcnt() 无法实现，是否有替代方案来实现所需的输出？

Answer 1

尝试

library(tokenizers)
tokenize_ngrams(sentence, n = 2L)
# [[1]]
# [1] "a sample"        "sample sentence" "sentence in"     "in english"      "english for"     "for testing"     "testing purpose"

是否可以在 R 中的 textcnt 函数的输出中保持 ngram 的顺序？

Is it possible to maintain order of ngrams in the output of textcnt function in R?

r

n-gram