"Bag of characters" R 中的 n-gram
"Bag of characters" n-grams in R
我想创建一个包含字符 n-gram 的术语文档矩阵。例如,拿下面这句话:
"In this paper, we focus on a different but simple text representation."
字符 4-gram 将是:|In_t|、|n_th|、|_thi|、|this|、|his__|、|is_p |, |s_pa|, |_pap|, |pape|, |aper| 等
我已经使用 R/Weka 包来处理 "bag of words" n-grams,但我很难调整分词器,例如下面的分词器来处理字符:
BigramTokenizer <- function(x){
NGramTokenizer(x, Weka_control(min = 2, max = 2))}
tdm_bigram <- TermDocumentMatrix(corpus,
control = list(
tokenize = BigramTokenizer, wordLengths=c(2,Inf)))
对如何使用 R/Weka 或其他包来创建字符 n-gram 有任何想法吗?
您需要改用 CharacterNGramTokenizer
。
NGramTokenizer
按空格等字符拆分。
##########
### the following lines are mainly a one to one copy from RWeka.
### Only hardocded CharacterNGramTokenizer is new
library(rJava)
CharacterNGramTokenizer <- structure(function (x, control = NULL)
{
tokenizer <- .jnew("weka/core/tokenizers/CharacterNGramTokenizer")
x <- Filter(nzchar, as.character(x))
if (!length(x))
return(character())
.jcall("RWekaInterfaces", "[S", "tokenize", .jcast(tokenizer,
"weka/core/tokenizers/Tokenizer"), .jarray(as.character(control)),
.jarray(as.character(x)))
}, class = c("R_Weka_tokenizer_interface", "R_Weka_interface"
), meta = structure(list(name = "weka/core/tokenizers/NGramTokenizer",
kind = "R_Weka_tokenizer_interface", class = "character",
init = NULL), .Names = c("name", "kind", "class", "init")))
### copy till here
###################
BigramTokenizer <- function(x){
CharacterNGramTokenizer(x, Weka_control(min = 2, max = 2))}
遗憾的是,默认情况下它不包含在 RWeka 中。
但是,如果你想使用weka,这似乎是一种整体版本
我发现 quanteda
很有用:
library(tm)
library(quanteda)
txts <- c("In this paper.", "In this lines this.")
tokens <- tokenize(gsub("\s", "_", txts), "character", ngrams=4L, conc="")
dfm <- dfm(tokens)
tdm <- as.TermDocumentMatrix(t(dfm), weighting=weightTf)
as.matrix(tdm)
# Docs
# Terms text1 text2
# In_t 1 1
# n_th 1 1
# _thi 1 2
# this 1 2
# his_ 1 1
# is_p 1 0
# s_pa 1 0
# _pap 1 0
# pape 1 0
# aper 1 0
# per. 1 0
# is_l 0 1
# s_li 0 1
# _lin 0 1
# line 0 1
# ines 0 1
# nes_ 0 1
# es_t 0 1
# s_th 0 1
# his. 0 1
我想创建一个包含字符 n-gram 的术语文档矩阵。例如,拿下面这句话:
"In this paper, we focus on a different but simple text representation."
字符 4-gram 将是:|In_t|、|n_th|、|_thi|、|this|、|his__|、|is_p |, |s_pa|, |_pap|, |pape|, |aper| 等
我已经使用 R/Weka 包来处理 "bag of words" n-grams,但我很难调整分词器,例如下面的分词器来处理字符:
BigramTokenizer <- function(x){
NGramTokenizer(x, Weka_control(min = 2, max = 2))}
tdm_bigram <- TermDocumentMatrix(corpus,
control = list(
tokenize = BigramTokenizer, wordLengths=c(2,Inf)))
对如何使用 R/Weka 或其他包来创建字符 n-gram 有任何想法吗?
您需要改用 CharacterNGramTokenizer
。
NGramTokenizer
按空格等字符拆分。
##########
### the following lines are mainly a one to one copy from RWeka.
### Only hardocded CharacterNGramTokenizer is new
library(rJava)
CharacterNGramTokenizer <- structure(function (x, control = NULL)
{
tokenizer <- .jnew("weka/core/tokenizers/CharacterNGramTokenizer")
x <- Filter(nzchar, as.character(x))
if (!length(x))
return(character())
.jcall("RWekaInterfaces", "[S", "tokenize", .jcast(tokenizer,
"weka/core/tokenizers/Tokenizer"), .jarray(as.character(control)),
.jarray(as.character(x)))
}, class = c("R_Weka_tokenizer_interface", "R_Weka_interface"
), meta = structure(list(name = "weka/core/tokenizers/NGramTokenizer",
kind = "R_Weka_tokenizer_interface", class = "character",
init = NULL), .Names = c("name", "kind", "class", "init")))
### copy till here
###################
BigramTokenizer <- function(x){
CharacterNGramTokenizer(x, Weka_control(min = 2, max = 2))}
遗憾的是,默认情况下它不包含在 RWeka 中。 但是,如果你想使用weka,这似乎是一种整体版本
我发现 quanteda
很有用:
library(tm)
library(quanteda)
txts <- c("In this paper.", "In this lines this.")
tokens <- tokenize(gsub("\s", "_", txts), "character", ngrams=4L, conc="")
dfm <- dfm(tokens)
tdm <- as.TermDocumentMatrix(t(dfm), weighting=weightTf)
as.matrix(tdm)
# Docs
# Terms text1 text2
# In_t 1 1
# n_th 1 1
# _thi 1 2
# this 1 2
# his_ 1 1
# is_p 1 0
# s_pa 1 0
# _pap 1 0
# pape 1 0
# aper 1 0
# per. 1 0
# is_l 0 1
# s_li 0 1
# _lin 0 1
# line 0 1
# ines 0 1
# nes_ 0 1
# es_t 0 1
# s_th 0 1
# his. 0 1