获取 R 字典中类别中的术语频率

Question

我有一个包含多个子类别的字典，我想使用 R 在每个子类别中找到最常见的单词和双字母组。

我正在使用一个大型数据集，但这是我的示例：

s <-  "Day after day, day after day,
We stuck, nor breath nor motion;"

library(stringi)
x <- stri_replace_all(s, "", regex="<.*?>") 
x <- stri_trim(s)
x <- stri_trans_tolower(s) 

library(quanteda)
toks <- tokens(x) 
toks <- tokens_wordstem(toks) 

dtm <- dfm(toks, 
       tolower=TRUE, stem=TRUE,
       remove=stopwords("english"))

dict1 <- dictionary(list(a=c("day*", "week*", "month*"),
                    b=c("breath*","motion*")))

dict_dtm2 <- dfm_lookup(dtm, dict1, nomatch="_unmatched")                                 
tail(dict_dtm2)

这给出了每个子类别的总频率，但不是这些子类别中每个单词的频率。我正在寻找的结果看起来像这样：

words(a)   freq
day         4
week        0
month       0

words(b)   freq
breath     1
motion     1

如有任何帮助，我将不胜感激！

Answer 1

据我了解你的问题，我相信你正在寻找 table() 命令。您需要使用一些正则表达式来处理第一句话，但我相信您可以做到。一个想法可以如下：

s <-  "day after day day after day We stuck nor breath nor motion"
s <- strsplit(s, "\s+")

dict <- list(a<- c("day", "week", "month"),
                        b<-c("breath","motion"))
lapply(dict, function(x){
                Wordsinvect<-intersect(unlist(x),unlist(s))
                return(table(s)[Wordsinvect])}
)


# [[1]]
# day 
# 4 
# 
# [[2]]
# s
# breath motion 
# 1      1

希望对您有所帮助。干杯！

获取 R 字典中类别中的术语频率

Get term frequencies within categories in R dictionary

r

text-analysis

text-mining

quanteda