"Bag of characters" R 中的 n-gram

"Bag of characters" n-grams in R

我想创建一个包含字符 n-gram 的术语文档矩阵。例如,拿下面这句话:

"In this paper, we focus on a different but simple text representation."

字符 4-gram 将是:|In_t|、|n_th|、|_thi|、|this|、|his__|、|is_p |, |s_pa|, |_pap|, |pape|, |aper| 等

我已经使用 R/Weka 包来处理 "bag of words" n-grams,但我很难调整分词器,例如下面的分词器来处理字符:

BigramTokenizer <- function(x){
    NGramTokenizer(x, Weka_control(min = 2, max = 2))}

tdm_bigram <- TermDocumentMatrix(corpus,
                                 control = list(
                                 tokenize = BigramTokenizer, wordLengths=c(2,Inf)))

对如何使用 R/Weka 或其他包来创建字符 n-gram 有任何想法吗?

您需要改用 CharacterNGramTokenizerNGramTokenizer 按空格等字符拆分。

##########
### the following lines are mainly a one to one copy from RWeka.
### Only hardocded CharacterNGramTokenizer is new
library(rJava)


CharacterNGramTokenizer <- structure(function (x, control = NULL) 
{
  tokenizer <- .jnew("weka/core/tokenizers/CharacterNGramTokenizer")
  x <- Filter(nzchar, as.character(x))
  if (!length(x)) 
    return(character())
  .jcall("RWekaInterfaces", "[S", "tokenize", .jcast(tokenizer, 
                                                     "weka/core/tokenizers/Tokenizer"), .jarray(as.character(control)), 
         .jarray(as.character(x)))
}, class = c("R_Weka_tokenizer_interface", "R_Weka_interface"
), meta = structure(list(name = "weka/core/tokenizers/NGramTokenizer", 
                         kind = "R_Weka_tokenizer_interface", class = "character", 
                         init = NULL), .Names = c("name", "kind", "class", "init")))
### copy till here
###################

BigramTokenizer <- function(x){
    CharacterNGramTokenizer(x, Weka_control(min = 2, max = 2))}

遗憾的是,默认情况下它不包含在 RWeka 中。 但是,如果你想使用weka,这似乎是一种整体版本

我发现 quanteda 很有用:

library(tm)
library(quanteda)
txts <- c("In this paper.", "In this lines this.")
tokens <- tokenize(gsub("\s", "_", txts), "character", ngrams=4L, conc="")
dfm <- dfm(tokens)
tdm <- as.TermDocumentMatrix(t(dfm), weighting=weightTf)
as.matrix(tdm)
#       Docs
# Terms  text1 text2
#   In_t     1     1
#   n_th     1     1
#   _thi     1     2
#   this     1     2
#   his_     1     1
#   is_p     1     0
#   s_pa     1     0
#   _pap     1     0
#   pape     1     0
#   aper     1     0
#   per.     1     0
#   is_l     0     1
#   s_li     0     1
#   _lin     0     1
#   line     0     1
#   ines     0     1
#   nes_     0     1
#   es_t     0     1
#   s_th     0     1
#   his.     0     1