R中文本中任何一组单词的最大出现次数
Maximum occurrence of any set of words in text in R
给定一组行,我必须找到单词的最大出现次数(不必是单个单词,也可以是一组单词。)
说,我有这样的文字,
string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend"
我希望输出为,
john beck - 3
chemical engineer - 2
是否有任何功能或包可以做到这一点?
试试这个:
string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend"
library(tau)
library(tm)
tokens <- MC_tokenizer(string)
tokens <- tokens[tokens != ""]
string_ <- paste(stemCompletion(stemDocument(tokens), tokens), collapse = " ")
## if you want only bi-grams:
tab <- sort(textcnt(string_, method = "string", n = 2), decreasing = TRUE)
data.frame(Freq = tab[tab > 1])
# Freq
# john beck 3
# chemical engineer 2
## if you want uni-, bi- and tri-grams:
nmin <- 1; nmax <- 3
tab <- sort(do.call(c, lapply(nmin:nmax, function(x) textcnt(string_, method = "string", n = x) )), decreasing = TRUE)
data.frame(Freq = tab[tab > 1])
# Freq
# beck 3
# john 3
# john beck 3
# chemical 2
# engineer 2
# is 2
# chemical engineer 2
也可以试试这个,使用 quanteda 包:
require(quanteda)
mydfm <- dfm(string, ngrams = 1:2, concatenator = "_", stem = TRUE, verbose = FALSE)
topfeatures(mydfm)
## beck john john_beck chemic chemical_engin engin is
## 3 3 3 2 2 2 2
## an an_chem are
## 1 1 1
你失去了词干,但这算 "john beck" 三次而不是两次(因为没有词干,"john beck's" 将是一个单独的类型)。
虽然更简单!
给定一组行,我必须找到单词的最大出现次数(不必是单个单词,也可以是一组单词。)
说,我有这样的文字,
string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend"
我希望输出为,
john beck - 3
chemical engineer - 2
是否有任何功能或包可以做到这一点?
试试这个:
string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend"
library(tau)
library(tm)
tokens <- MC_tokenizer(string)
tokens <- tokens[tokens != ""]
string_ <- paste(stemCompletion(stemDocument(tokens), tokens), collapse = " ")
## if you want only bi-grams:
tab <- sort(textcnt(string_, method = "string", n = 2), decreasing = TRUE)
data.frame(Freq = tab[tab > 1])
# Freq
# john beck 3
# chemical engineer 2
## if you want uni-, bi- and tri-grams:
nmin <- 1; nmax <- 3
tab <- sort(do.call(c, lapply(nmin:nmax, function(x) textcnt(string_, method = "string", n = x) )), decreasing = TRUE)
data.frame(Freq = tab[tab > 1])
# Freq
# beck 3
# john 3
# john beck 3
# chemical 2
# engineer 2
# is 2
# chemical engineer 2
也可以试试这个,使用 quanteda 包:
require(quanteda)
mydfm <- dfm(string, ngrams = 1:2, concatenator = "_", stem = TRUE, verbose = FALSE)
topfeatures(mydfm)
## beck john john_beck chemic chemical_engin engin is
## 3 3 3 2 2 2 2
## an an_chem are
## 1 1 1
你失去了词干,但这算 "john beck" 三次而不是两次(因为没有词干,"john beck's" 将是一个单独的类型)。
虽然更简单!