如何在没有 tm 包的情况下获得所有可能的 2 个单词组合及其频率

Question

我有这样一条短信：

dat<-c("this is my farm this is my land")

我想获得所有可能的 2 个单词组合及其出现频率。我不能使用 tm 包，所以任何其他解决方案将不胜感激。输出应该是这样的：

two words freq
this is     2
is my       2
my farm     1
my land     1

Answer 1

可以通过拆分dat然后提取连续的两个单词组合来生成组合。然后，gregexpr 可以用来计算出现次数。

temp = unlist(strsplit(dat, " "))
temp2 = unique(sapply(2:length(temp), function(i)
    paste(temp[(i-1):i], collapse = " ")))
sapply(temp2, function(x)
    length(unlist(gregexpr(pattern = x, text = dat))))
#  this is     is my   my farm farm this   my land 
#        2         2         1         1         1

或者三个单词的组合

temp = unlist(strsplit(dat, " "))
temp2 = unique(sapply(3:length(temp), function(i)
    paste(temp[(i-2):i], collapse = " ")))
sapply(temp2, function(x)
    length(unlist(gregexpr(pattern = x, text = dat))))
#  this is my   is my farm my farm this farm this is   is my land 
#           2            1            1            1            1

如何在没有 tm 包的情况下获得所有可能的 2 个单词组合及其频率

How to get all possible 2 words combinations with their frequency without tm package

text

r

text-analysis