将函数应用于文本重用语料库

apply function to textreuse corpus

我有一个数据框如下:

df<-data.frame(revtext=c('the dog that chased the cat', 'the dog which chased the cat', 'World Cup Hair 2014 very funny.i can change', 'BowBow', 'this is'), rid=c('r01','r02','r03','r04','r05'), stringsAsFactors = FALSE)

                             revtext        rid
             the dog that chased the cat    r01
             the dog which chased the cat   r02
World Cup Hair 2014 very funny.i can change r03
             Bow Bow                        r04
             this is                        r05

我正在使用包 textreusedf 转换为 corpus 做:

#install.packages(textreuse)
library(textreuse)
d<-df$revtext
names(d)<-df$rid
corpus <- TextReuseCorpus(text = d,
                      tokenizer = tokenize_character, k=3,
                      progress = FALSE,
                      keep_tokens = TRUE)

其中 tokenize_character 是我编写的函数:

 tokenize_character <- function(document, k) {
                       shingles<-c()
                 for( i in 1:( nchar(document) - k + 1 ) ) {
                         shingles[i] <- substr(document,start=i,stop= (i+k-1))
                     }
return( unique(shingles) )  
}   

但是,系统提示我一些警告:Skipping document with ID 'r04' because it has too few words to create at least two n-grams with n = 3.。但请注意,我的分词器在字符级别上工作。 r04的文字够长了。事实上,如果我们 运行 tokenize_character('BowBow',3) 我们得到: "Bow" "owB" "wBo"

另请注意,对于 r01TextReuseCorpus 按预期工作,返回:tokens(corpus)$`r01= "the" "he " "e d" " do" "dog" "og " "g t" " th" "tha" "hat" "at " "t c" " ch" "cha" "has" "ase" "sed" "ed " "d t" "e c" " ca" "cat"

有什么建议吗?我不知道我在这里错过了什么。

来自 textreuse::TextReuseCorpus documentation 的详细信息部分:

If skip_short = TRUE, this function will skip very short or empty documents. A very short document is one where there are two few words to create at least two n-grams. For example, if five-grams are desired, then a document must be at least six words long. If no value of n is provided, then the function assumes a value of n = 3.

据此,我们知道少于 4 个单词的文档将作为短文档被跳过(在您的示例中为 n=3),这就是我们看到的 r04r05分别为 1 和 2 个单词。 要不跳过这些文档,您可以使用 skip_short = F,这将 return 按预期输出:

corpus <- TextReuseCorpus(text = d, tokenizer = tokenize_character, k=3,
                      skip_short = F, progress = FALSE, keep_tokens = TRUE)
tokens(corpus)$r04
[1] "Bow" "owB" "wBo"