通过R中的三元组生成所有单词一元组

Question

我正在尝试通过 R 中的三元组生成所有一元组的列表，最终制作一个文档短语矩阵，其中的列包括所有单个单词、二元组和三元组。

我希望为此找到一个简单的包，但没有成功。我确实最终指向 RWeka，下面的代码和输出，但不幸的是，这种方法会丢弃所有 2 或 1 个字符的一元组。

这条路能修好吗，或者大家知道另一条路吗？谢谢！

TrigramTokenizer <- function(x) NGramTokenizer(x, 
                                               Weka_control(min = 1, max = 3))
Text = c( "Ab Hello world","Hello ab",  "ab" )
tt = Corpus(VectorSource(Text))
tdm <- TermDocumentMatrix( tt, 
                           control = list(tokenize = TrigramTokenizer))
inspect(tdm)
# <<TermDocumentMatrix (terms: 6, documents: 3)>>
# Non-/sparse entries: 7/11
# Sparsity           : 61%
# Maximal term length: 14
# Weighting          : term frequency (tf)

#                 Docs
# Terms            1 2 3
#   ab hello       1 0 0
#   ab hello world 1 0 0
#   hello          1 1 0
#   hello ab       0 1 0
#   hello world    1 0 0
#   world          1 0 0

这是下面的 ngram() 版本，经过优化编辑（我认为）。基本上，当 include.all=TRUE.

时，我尝试重用令牌字符串以退出双循环

ngram <- function(tokens, n = 2, concatenator = "_", include.all = FALSE) {
    M = length(tokens)

    stopifnot( n > 0 )

    # if include.all=FALSE return null if nothing to report due to short doc
    if ( ( M == 0 ) || ( !include.all && M < n ) ) {
        return( c() )
    }

    # bail if just want original tokens or if we only have one token
    if ( (n == 1) || (M == 1) ) {
        return( tokens )
    }

    # set max size of ngram at max length of tokens
    end <- min( M-1, n-1 )

    all_ngrams <- c()
    toks = tokens
    for (width in 1:end) {
        if ( include.all ) {
            all_ngrams <- c( all_ngrams, toks )
        }
        toks = paste( toks[1:(M-width)], tokens[(1+width):M], sep=concatenator )
    }
    all_ngrams <- c( all_ngrams, toks )

    all_ngrams
}

ngram( c("A","B","C","D"), n=3, include.all=TRUE ) 
ngram( c("A","B","C","D"), n=3, include.all=FALSE ) 

ngram( c("A","B","C","D"), n=10, include.all=FALSE ) 
ngram( c("A","B","C","D"), n=10, include.all=TRUE ) 


# edge cases
ngram( c(), n=3, include.all=TRUE ) 
ngram( "A", n=0, include.all=TRUE ) 
ngram( "A", n=3, include.all=TRUE ) 
ngram( "A", n=3, include.all=FALSE ) 
ngram( "A", n=1, include.all=FALSE ) 
ngram( "A", n=1, include.all=TRUE ) 
ngram( c("A","B"), n=1, include.all=FALSE ) 
ngram( c("A","B"), n=1, include.all=TRUE ) 
ngram( c("A","B","C"), n=1, include.all=FALSE ) 
ngram( c("A","B","C"), n=1, include.all=TRUE )

Answer 1

你很幸运，有一个包：quanteda。

# or: devtools::install_github("kbenoit/quanteda")
require(quanteda)

Text <- c("Ab Hello world", "Hello ab", "ab")

dfm(Text, ngrams = 1:3, verbose = FALSE)
## Document-feature matrix of: 3 documents, 7 features.
## 3 x 7 sparse Matrix of class "dfmSparse"
## features
## docs    ab ab_hello ab_hello_world hello hello_ab hello_world world
## text1  1        1              1     1        0           1     1
## text2  1        0              0     1        1           0     0
## text3  1        0              0     0        0           0     0

这将创建一个 document-feature 矩阵，其中 "features" 是小写的 unigrams、bigrams 和 trigrams。如果您更喜欢单词之间的空格，只需将参数 concatenator = " " 添加到 dfm() 调用即可。

问题已解决，不需要Weka。

出于好奇，这里是创建 n-grams 的主力函数，其中 tokens 是一个字符向量（来自单独的分词器）：

ngram <- function(tokens, n = 2, concatenator = "_", include.all = FALSE) {

    # start with lower ngrams, or just the specified size if include.all = FALSE
    start <- ifelse(include.all, 
                    1, 
                    ifelse(length(tokens) < n, 1, n))

    # set max size of ngram at max length of tokens
    end <- ifelse(length(tokens) < n, length(tokens), n)

    all_ngrams <- c()
    # outer loop for all ngrams down to 1
    for (width in start:end) {
        new_ngrams <- tokens[1:(length(tokens) - width + 1)]
        # inner loop for ngrams of width > 1
        if (width > 1) {
            for (i in 1:(width - 1)) 
                new_ngrams <- paste(new_ngrams, 
                                    tokens[(i + 1):(length(tokens) - width + 1 + i)], 
                                    sep = concatenator)
        }
        # paste onto previous results and continue
        all_ngrams <- c(all_ngrams, new_ngrams)
    }

    all_ngrams
}

Answer 2

糟糕。事实证明，您可以将一些选项传递给控件来执行此操作。 termFreq 方法被调用，您可以将选项传递给它，例如要使用的分词器（如上所述）以及要执行的清理操作。

所以这个调整有效：

TrigramTokenizer <- function(x) NGramTokenizer(x, 
                                                Weka_control(min = 1, max = 3))
Text = c( "Ab Hello world","Hello ab",  "ab" )
tt = Corpus(VectorSource(Text))
tdm <- TermDocumentMatrix( tt, 
                            control = list(wordLengths=c(1,Inf), tokenize = TrigramTokenizer))
inspect(tdm)

给予

<<TermDocumentMatrix (terms: 7, documents: 3)>>
Non-/sparse entries: 10/11
Sparsity           : 52%
Maximal term length: 14
Weighting          : term frequency (tf)

                Docs
Terms            1 2 3
  ab             1 1 1
  ab hello       1 0 0
  ab hello world 1 0 0
  hello          1 1 0
  hello ab       0 1 0
  hello world    1 0 0
  world          1 0 0

通过R中的三元组生成所有单词一元组

Generating all word unigrams through trigrams in R

text-processing

r

tm

rweka

quanteda