如何在没有 tm 包的情况下获得所有可能的 2 个单词组合及其频率
How to get all possible 2 words combinations with their frequency without tm package
我有这样一条短信:
dat<-c("this is my farm this is my land")
我想获得所有可能的 2 个单词组合及其出现频率。
我不能使用 tm
包,所以任何其他解决方案将不胜感激。
输出应该是这样的:
two words freq
this is 2
is my 2
my farm 1
my land 1
可以通过拆分dat
然后提取连续的两个单词组合来生成组合。然后,gregexpr
可以用来计算出现次数。
temp = unlist(strsplit(dat, " "))
temp2 = unique(sapply(2:length(temp), function(i)
paste(temp[(i-1):i], collapse = " ")))
sapply(temp2, function(x)
length(unlist(gregexpr(pattern = x, text = dat))))
# this is is my my farm farm this my land
# 2 2 1 1 1
或者三个单词的组合
temp = unlist(strsplit(dat, " "))
temp2 = unique(sapply(3:length(temp), function(i)
paste(temp[(i-2):i], collapse = " ")))
sapply(temp2, function(x)
length(unlist(gregexpr(pattern = x, text = dat))))
# this is my is my farm my farm this farm this is is my land
# 2 1 1 1 1
我有这样一条短信:
dat<-c("this is my farm this is my land")
我想获得所有可能的 2 个单词组合及其出现频率。
我不能使用 tm
包,所以任何其他解决方案将不胜感激。
输出应该是这样的:
two words freq
this is 2
is my 2
my farm 1
my land 1
可以通过拆分dat
然后提取连续的两个单词组合来生成组合。然后,gregexpr
可以用来计算出现次数。
temp = unlist(strsplit(dat, " "))
temp2 = unique(sapply(2:length(temp), function(i)
paste(temp[(i-1):i], collapse = " ")))
sapply(temp2, function(x)
length(unlist(gregexpr(pattern = x, text = dat))))
# this is is my my farm farm this my land
# 2 2 1 1 1
或者三个单词的组合
temp = unlist(strsplit(dat, " "))
temp2 = unique(sapply(3:length(temp), function(i)
paste(temp[(i-2):i], collapse = " ")))
sapply(temp2, function(x)
length(unlist(gregexpr(pattern = x, text = dat))))
# this is my is my farm my farm this farm this is is my land
# 2 1 1 1 1