通过R中的三元组生成所有单词一元组
Generating all word unigrams through trigrams in R
我正在尝试通过 R 中的三元组生成所有一元组的列表,最终制作一个文档短语矩阵,其中的列包括所有单个单词、二元组和三元组。
我希望为此找到一个简单的包,但没有成功。我确实最终指向 RWeka,下面的代码和输出,但不幸的是,这种方法会丢弃所有 2 或 1 个字符的一元组。
这条路能修好吗,或者大家知道另一条路吗?谢谢!
TrigramTokenizer <- function(x) NGramTokenizer(x,
Weka_control(min = 1, max = 3))
Text = c( "Ab Hello world","Hello ab", "ab" )
tt = Corpus(VectorSource(Text))
tdm <- TermDocumentMatrix( tt,
control = list(tokenize = TrigramTokenizer))
inspect(tdm)
# <<TermDocumentMatrix (terms: 6, documents: 3)>>
# Non-/sparse entries: 7/11
# Sparsity : 61%
# Maximal term length: 14
# Weighting : term frequency (tf)
# Docs
# Terms 1 2 3
# ab hello 1 0 0
# ab hello world 1 0 0
# hello 1 1 0
# hello ab 0 1 0
# hello world 1 0 0
# world 1 0 0
这是下面的 ngram() 版本,经过优化编辑(我认为)。基本上,当 include.all=TRUE.
时,我尝试重用令牌字符串以退出双循环
ngram <- function(tokens, n = 2, concatenator = "_", include.all = FALSE) {
M = length(tokens)
stopifnot( n > 0 )
# if include.all=FALSE return null if nothing to report due to short doc
if ( ( M == 0 ) || ( !include.all && M < n ) ) {
return( c() )
}
# bail if just want original tokens or if we only have one token
if ( (n == 1) || (M == 1) ) {
return( tokens )
}
# set max size of ngram at max length of tokens
end <- min( M-1, n-1 )
all_ngrams <- c()
toks = tokens
for (width in 1:end) {
if ( include.all ) {
all_ngrams <- c( all_ngrams, toks )
}
toks = paste( toks[1:(M-width)], tokens[(1+width):M], sep=concatenator )
}
all_ngrams <- c( all_ngrams, toks )
all_ngrams
}
ngram( c("A","B","C","D"), n=3, include.all=TRUE )
ngram( c("A","B","C","D"), n=3, include.all=FALSE )
ngram( c("A","B","C","D"), n=10, include.all=FALSE )
ngram( c("A","B","C","D"), n=10, include.all=TRUE )
# edge cases
ngram( c(), n=3, include.all=TRUE )
ngram( "A", n=0, include.all=TRUE )
ngram( "A", n=3, include.all=TRUE )
ngram( "A", n=3, include.all=FALSE )
ngram( "A", n=1, include.all=FALSE )
ngram( "A", n=1, include.all=TRUE )
ngram( c("A","B"), n=1, include.all=FALSE )
ngram( c("A","B"), n=1, include.all=TRUE )
ngram( c("A","B","C"), n=1, include.all=FALSE )
ngram( c("A","B","C"), n=1, include.all=TRUE )
你很幸运,有一个包:quanteda。
# or: devtools::install_github("kbenoit/quanteda")
require(quanteda)
Text <- c("Ab Hello world", "Hello ab", "ab")
dfm(Text, ngrams = 1:3, verbose = FALSE)
## Document-feature matrix of: 3 documents, 7 features.
## 3 x 7 sparse Matrix of class "dfmSparse"
## features
## docs ab ab_hello ab_hello_world hello hello_ab hello_world world
## text1 1 1 1 1 0 1 1
## text2 1 0 0 1 1 0 0
## text3 1 0 0 0 0 0 0
这将创建一个 document-feature 矩阵,其中 "features" 是小写的 unigrams、bigrams 和 trigrams。如果您更喜欢单词之间的空格,只需将参数 concatenator = " "
添加到 dfm()
调用即可。
问题已解决,不需要Weka。
出于好奇,这里是创建 n-grams 的主力函数,其中 tokens
是一个字符向量(来自单独的分词器):
ngram <- function(tokens, n = 2, concatenator = "_", include.all = FALSE) {
# start with lower ngrams, or just the specified size if include.all = FALSE
start <- ifelse(include.all,
1,
ifelse(length(tokens) < n, 1, n))
# set max size of ngram at max length of tokens
end <- ifelse(length(tokens) < n, length(tokens), n)
all_ngrams <- c()
# outer loop for all ngrams down to 1
for (width in start:end) {
new_ngrams <- tokens[1:(length(tokens) - width + 1)]
# inner loop for ngrams of width > 1
if (width > 1) {
for (i in 1:(width - 1))
new_ngrams <- paste(new_ngrams,
tokens[(i + 1):(length(tokens) - width + 1 + i)],
sep = concatenator)
}
# paste onto previous results and continue
all_ngrams <- c(all_ngrams, new_ngrams)
}
all_ngrams
}
糟糕。事实证明,您可以将一些选项传递给控件来执行此操作。 termFreq
方法被调用,您可以将选项传递给它,例如要使用的分词器(如上所述)以及要执行的清理操作。
所以这个调整有效:
TrigramTokenizer <- function(x) NGramTokenizer(x,
Weka_control(min = 1, max = 3))
Text = c( "Ab Hello world","Hello ab", "ab" )
tt = Corpus(VectorSource(Text))
tdm <- TermDocumentMatrix( tt,
control = list(wordLengths=c(1,Inf), tokenize = TrigramTokenizer))
inspect(tdm)
给予
<<TermDocumentMatrix (terms: 7, documents: 3)>>
Non-/sparse entries: 10/11
Sparsity : 52%
Maximal term length: 14
Weighting : term frequency (tf)
Docs
Terms 1 2 3
ab 1 1 1
ab hello 1 0 0
ab hello world 1 0 0
hello 1 1 0
hello ab 0 1 0
hello world 1 0 0
world 1 0 0
我正在尝试通过 R 中的三元组生成所有一元组的列表,最终制作一个文档短语矩阵,其中的列包括所有单个单词、二元组和三元组。
我希望为此找到一个简单的包,但没有成功。我确实最终指向 RWeka,下面的代码和输出,但不幸的是,这种方法会丢弃所有 2 或 1 个字符的一元组。
这条路能修好吗,或者大家知道另一条路吗?谢谢!
TrigramTokenizer <- function(x) NGramTokenizer(x,
Weka_control(min = 1, max = 3))
Text = c( "Ab Hello world","Hello ab", "ab" )
tt = Corpus(VectorSource(Text))
tdm <- TermDocumentMatrix( tt,
control = list(tokenize = TrigramTokenizer))
inspect(tdm)
# <<TermDocumentMatrix (terms: 6, documents: 3)>>
# Non-/sparse entries: 7/11
# Sparsity : 61%
# Maximal term length: 14
# Weighting : term frequency (tf)
# Docs
# Terms 1 2 3
# ab hello 1 0 0
# ab hello world 1 0 0
# hello 1 1 0
# hello ab 0 1 0
# hello world 1 0 0
# world 1 0 0
这是下面的 ngram() 版本,经过优化编辑(我认为)。基本上,当 include.all=TRUE.
时,我尝试重用令牌字符串以退出双循环ngram <- function(tokens, n = 2, concatenator = "_", include.all = FALSE) {
M = length(tokens)
stopifnot( n > 0 )
# if include.all=FALSE return null if nothing to report due to short doc
if ( ( M == 0 ) || ( !include.all && M < n ) ) {
return( c() )
}
# bail if just want original tokens or if we only have one token
if ( (n == 1) || (M == 1) ) {
return( tokens )
}
# set max size of ngram at max length of tokens
end <- min( M-1, n-1 )
all_ngrams <- c()
toks = tokens
for (width in 1:end) {
if ( include.all ) {
all_ngrams <- c( all_ngrams, toks )
}
toks = paste( toks[1:(M-width)], tokens[(1+width):M], sep=concatenator )
}
all_ngrams <- c( all_ngrams, toks )
all_ngrams
}
ngram( c("A","B","C","D"), n=3, include.all=TRUE )
ngram( c("A","B","C","D"), n=3, include.all=FALSE )
ngram( c("A","B","C","D"), n=10, include.all=FALSE )
ngram( c("A","B","C","D"), n=10, include.all=TRUE )
# edge cases
ngram( c(), n=3, include.all=TRUE )
ngram( "A", n=0, include.all=TRUE )
ngram( "A", n=3, include.all=TRUE )
ngram( "A", n=3, include.all=FALSE )
ngram( "A", n=1, include.all=FALSE )
ngram( "A", n=1, include.all=TRUE )
ngram( c("A","B"), n=1, include.all=FALSE )
ngram( c("A","B"), n=1, include.all=TRUE )
ngram( c("A","B","C"), n=1, include.all=FALSE )
ngram( c("A","B","C"), n=1, include.all=TRUE )
你很幸运,有一个包:quanteda。
# or: devtools::install_github("kbenoit/quanteda")
require(quanteda)
Text <- c("Ab Hello world", "Hello ab", "ab")
dfm(Text, ngrams = 1:3, verbose = FALSE)
## Document-feature matrix of: 3 documents, 7 features.
## 3 x 7 sparse Matrix of class "dfmSparse"
## features
## docs ab ab_hello ab_hello_world hello hello_ab hello_world world
## text1 1 1 1 1 0 1 1
## text2 1 0 0 1 1 0 0
## text3 1 0 0 0 0 0 0
这将创建一个 document-feature 矩阵,其中 "features" 是小写的 unigrams、bigrams 和 trigrams。如果您更喜欢单词之间的空格,只需将参数 concatenator = " "
添加到 dfm()
调用即可。
问题已解决,不需要Weka。
出于好奇,这里是创建 n-grams 的主力函数,其中 tokens
是一个字符向量(来自单独的分词器):
ngram <- function(tokens, n = 2, concatenator = "_", include.all = FALSE) {
# start with lower ngrams, or just the specified size if include.all = FALSE
start <- ifelse(include.all,
1,
ifelse(length(tokens) < n, 1, n))
# set max size of ngram at max length of tokens
end <- ifelse(length(tokens) < n, length(tokens), n)
all_ngrams <- c()
# outer loop for all ngrams down to 1
for (width in start:end) {
new_ngrams <- tokens[1:(length(tokens) - width + 1)]
# inner loop for ngrams of width > 1
if (width > 1) {
for (i in 1:(width - 1))
new_ngrams <- paste(new_ngrams,
tokens[(i + 1):(length(tokens) - width + 1 + i)],
sep = concatenator)
}
# paste onto previous results and continue
all_ngrams <- c(all_ngrams, new_ngrams)
}
all_ngrams
}
糟糕。事实证明,您可以将一些选项传递给控件来执行此操作。 termFreq
方法被调用,您可以将选项传递给它,例如要使用的分词器(如上所述)以及要执行的清理操作。
所以这个调整有效:
TrigramTokenizer <- function(x) NGramTokenizer(x,
Weka_control(min = 1, max = 3))
Text = c( "Ab Hello world","Hello ab", "ab" )
tt = Corpus(VectorSource(Text))
tdm <- TermDocumentMatrix( tt,
control = list(wordLengths=c(1,Inf), tokenize = TrigramTokenizer))
inspect(tdm)
给予
<<TermDocumentMatrix (terms: 7, documents: 3)>>
Non-/sparse entries: 10/11
Sparsity : 52%
Maximal term length: 14
Weighting : term frequency (tf)
Docs
Terms 1 2 3
ab 1 1 1
ab hello 1 0 0
ab hello world 1 0 0
hello 1 1 0
hello ab 0 1 0
hello world 1 0 0
world 1 0 0