将函数应用于文本重用语料库
apply function to textreuse corpus
我有一个数据框如下:
df<-data.frame(revtext=c('the dog that chased the cat', 'the dog which chased the cat', 'World Cup Hair 2014 very funny.i can change', 'BowBow', 'this is'), rid=c('r01','r02','r03','r04','r05'), stringsAsFactors = FALSE)
revtext rid
the dog that chased the cat r01
the dog which chased the cat r02
World Cup Hair 2014 very funny.i can change r03
Bow Bow r04
this is r05
我正在使用包 textreuse
将 df
转换为 corpus
做:
#install.packages(textreuse)
library(textreuse)
d<-df$revtext
names(d)<-df$rid
corpus <- TextReuseCorpus(text = d,
tokenizer = tokenize_character, k=3,
progress = FALSE,
keep_tokens = TRUE)
其中 tokenize_character
是我编写的函数:
tokenize_character <- function(document, k) {
shingles<-c()
for( i in 1:( nchar(document) - k + 1 ) ) {
shingles[i] <- substr(document,start=i,stop= (i+k-1))
}
return( unique(shingles) )
}
但是,系统提示我一些警告:Skipping document with ID 'r04' because it has too few words to create at least two n-grams with n = 3.
。但请注意,我的分词器在字符级别上工作。 r04
的文字够长了。事实上,如果我们 运行 tokenize_character('BowBow',3)
我们得到: "Bow" "owB" "wBo"
。
另请注意,对于 r01
,TextReuseCorpus
按预期工作,返回:tokens(corpus)$`r01= "the" "he " "e d" " do" "dog" "og " "g t" " th" "tha" "hat" "at " "t c" " ch" "cha" "has" "ase" "sed" "ed " "d t" "e c" " ca" "cat"
有什么建议吗?我不知道我在这里错过了什么。
来自 textreuse::TextReuseCorpus
documentation 的详细信息部分:
If skip_short = TRUE, this function will skip very short or empty
documents. A very short document is one where there are two few words
to create at least two n-grams. For example, if five-grams are
desired, then a document must be at least six words long. If no value of n is
provided, then the function assumes a value of n = 3.
据此,我们知道少于 4 个单词的文档将作为短文档被跳过(在您的示例中为 n=3),这就是我们看到的 r04
和 r05
分别为 1 和 2 个单词。
要不跳过这些文档,您可以使用 skip_short = F
,这将 return 按预期输出:
corpus <- TextReuseCorpus(text = d, tokenizer = tokenize_character, k=3,
skip_short = F, progress = FALSE, keep_tokens = TRUE)
tokens(corpus)$r04
[1] "Bow" "owB" "wBo"
我有一个数据框如下:
df<-data.frame(revtext=c('the dog that chased the cat', 'the dog which chased the cat', 'World Cup Hair 2014 very funny.i can change', 'BowBow', 'this is'), rid=c('r01','r02','r03','r04','r05'), stringsAsFactors = FALSE)
revtext rid
the dog that chased the cat r01
the dog which chased the cat r02
World Cup Hair 2014 very funny.i can change r03
Bow Bow r04
this is r05
我正在使用包 textreuse
将 df
转换为 corpus
做:
#install.packages(textreuse)
library(textreuse)
d<-df$revtext
names(d)<-df$rid
corpus <- TextReuseCorpus(text = d,
tokenizer = tokenize_character, k=3,
progress = FALSE,
keep_tokens = TRUE)
其中 tokenize_character
是我编写的函数:
tokenize_character <- function(document, k) {
shingles<-c()
for( i in 1:( nchar(document) - k + 1 ) ) {
shingles[i] <- substr(document,start=i,stop= (i+k-1))
}
return( unique(shingles) )
}
但是,系统提示我一些警告:Skipping document with ID 'r04' because it has too few words to create at least two n-grams with n = 3.
。但请注意,我的分词器在字符级别上工作。 r04
的文字够长了。事实上,如果我们 运行 tokenize_character('BowBow',3)
我们得到: "Bow" "owB" "wBo"
。
另请注意,对于 r01
,TextReuseCorpus
按预期工作,返回:tokens(corpus)$`r01= "the" "he " "e d" " do" "dog" "og " "g t" " th" "tha" "hat" "at " "t c" " ch" "cha" "has" "ase" "sed" "ed " "d t" "e c" " ca" "cat"
有什么建议吗?我不知道我在这里错过了什么。
来自 textreuse::TextReuseCorpus
documentation 的详细信息部分:
If skip_short = TRUE, this function will skip very short or empty documents. A very short document is one where there are two few words to create at least two n-grams. For example, if five-grams are desired, then a document must be at least six words long. If no value of n is provided, then the function assumes a value of n = 3.
据此,我们知道少于 4 个单词的文档将作为短文档被跳过(在您的示例中为 n=3),这就是我们看到的 r04
和 r05
分别为 1 和 2 个单词。
要不跳过这些文档,您可以使用 skip_short = F
,这将 return 按预期输出:
corpus <- TextReuseCorpus(text = d, tokenizer = tokenize_character, k=3,
skip_short = F, progress = FALSE, keep_tokens = TRUE)
tokens(corpus)$r04
[1] "Bow" "owB" "wBo"