是否可以在 R 中的 textcnt 函数的输出中保持 ngram 的顺序?
Is it possible to maintain order of ngrams in the output of textcnt function in R?
我正在使用 tau
包中的 textcnt()
函数来获取二元语法,如下所示:
sentence <- "A sample sentence in English for testing purpose"
english <- textcnt(sentence, method = "string", n=2, tolower = FALSE)
二元组 returned 按字母顺序排列,如下所示:
A sample English for for testing in English sample sentence sentence in testing purpose
不过,我正在寻找一种解决方案,可以 return 双字母按句子中出现的顺序排列。更准确地说,所需的输出如下:
A sample sample sentence sentence in in English English for for testing testing purpose
如果 textcnt()
无法实现,是否有替代方案来实现所需的输出?
尝试
library(tokenizers)
tokenize_ngrams(sentence, n = 2L)
# [[1]]
# [1] "a sample" "sample sentence" "sentence in" "in english" "english for" "for testing" "testing purpose"
我正在使用 tau
包中的 textcnt()
函数来获取二元语法,如下所示:
sentence <- "A sample sentence in English for testing purpose"
english <- textcnt(sentence, method = "string", n=2, tolower = FALSE)
二元组 returned 按字母顺序排列,如下所示:
A sample English for for testing in English sample sentence sentence in testing purpose
不过,我正在寻找一种解决方案,可以 return 双字母按句子中出现的顺序排列。更准确地说,所需的输出如下:
A sample sample sentence sentence in in English English for for testing testing purpose
如果 textcnt()
无法实现,是否有替代方案来实现所需的输出?
尝试
library(tokenizers)
tokenize_ngrams(sentence, n = 2L)
# [[1]]
# [1] "a sample" "sample sentence" "sentence in" "in english" "english for" "for testing" "testing purpose"