将短语列表与文档语料库匹配并返回短语频率
Matching a list of phrases to a corpus of documents and returning phrase frequency
我有一个短语列表和语料库 documents.There 语料库中有 100k+ 短语和 60k+ 文档。这些短语 might/might 不存在于语料库中。我期待找到语料库中每个短语的词频。
示例数据集:
Phrases <- c("just starting", "several kilometers", "brief stroll", "gradually boost", "5 miles", "dark night", "cold morning")
Doc1 <- "If you're just starting with workout, begin slow."
Doc2 <- "Don't jump in brain initial and then try to operate several kilometers without the need of worked out well before."
Doc3 <- "It is possible to end up injuring on your own and carrying out more damage than good."
Doc4 <- "Instead start with a brief stroll and gradually boost the duration along with the speed."
Doc5 <- "Before you know it you'll be working 5 miles without any problems."
我是 R 中文本分析的新手,并且按照 Tyler Rinker 对此 R Text Mining: Counting the number of times a specific word appears in a corpus? 的解决方案解决了这个问题。
到目前为止,这是我的方法:
library(tm)
library(qdap)
Docs <- c(Doc1, Doc2, Doc3, Doc4, Doc5)
text <- removeWords(Docs, stopwords("english"))
text <- removePunctuation(text)
text <- tolower(text)
corp <- Corpus(VectorSource(text))
Phrases <- tolower(Phrases)
word.freq <- apply_as_df(corp, termco_d, match.string=Phrases)
mcsv_w(word.freq, dir = NULL, open = T, sep = ", ", dataframes = NULL,
pos = 1, envir = as.environment(pos))
当我以 csv 格式导出结果时,它只会告诉我短语 1 是否存在于任何文档中。
我期待如下输出(不包括不匹配的短语):
Docs Phrase1 Phrase2 Phrase3 Phrase4 Phrase5
1 0 1 2 0 0
2 1 0 0 1 0
我试过你的方法,但无法复制:
使用:
library(tm)
library(qdap)
Docs <- c(Doc1, Doc2, Doc3, Doc4, Doc5)
text <- removeWords(Docs, stopwords("english"))
text <- removePunctuation(text)
text <- tolower(text)
corp <- Corpus(VectorSource(text))
Phrases <- tolower(Phrases)
word.freq <- apply_as_df(corp, termco_d, match.string = Phrases)
mcsv_w(word.freq, dir = NULL, open = T, sep = ", ", dataframes = NULL,
pos = 1, envir = as.environment(pos))
我得到以下 csv:
docs word.count term(just starting) term(several kilometers) term(brief stroll) term(gradually boost) term(5 miles) term(dark night) term(cold morning)
1 7 1 0 0 0 0 0 0
2 12 0 1 0 0 0 0 0
3 7 0 0 0 0 0 0 0
4 9 0 0 1 1 0 0 0
5 7 0 0 0 0 0 0 0
我有一个短语列表和语料库 documents.There 语料库中有 100k+ 短语和 60k+ 文档。这些短语 might/might 不存在于语料库中。我期待找到语料库中每个短语的词频。
示例数据集:
Phrases <- c("just starting", "several kilometers", "brief stroll", "gradually boost", "5 miles", "dark night", "cold morning")
Doc1 <- "If you're just starting with workout, begin slow."
Doc2 <- "Don't jump in brain initial and then try to operate several kilometers without the need of worked out well before."
Doc3 <- "It is possible to end up injuring on your own and carrying out more damage than good."
Doc4 <- "Instead start with a brief stroll and gradually boost the duration along with the speed."
Doc5 <- "Before you know it you'll be working 5 miles without any problems."
我是 R 中文本分析的新手,并且按照 Tyler Rinker 对此 R Text Mining: Counting the number of times a specific word appears in a corpus? 的解决方案解决了这个问题。
到目前为止,这是我的方法:
library(tm)
library(qdap)
Docs <- c(Doc1, Doc2, Doc3, Doc4, Doc5)
text <- removeWords(Docs, stopwords("english"))
text <- removePunctuation(text)
text <- tolower(text)
corp <- Corpus(VectorSource(text))
Phrases <- tolower(Phrases)
word.freq <- apply_as_df(corp, termco_d, match.string=Phrases)
mcsv_w(word.freq, dir = NULL, open = T, sep = ", ", dataframes = NULL,
pos = 1, envir = as.environment(pos))
当我以 csv 格式导出结果时,它只会告诉我短语 1 是否存在于任何文档中。
我期待如下输出(不包括不匹配的短语):
Docs Phrase1 Phrase2 Phrase3 Phrase4 Phrase5
1 0 1 2 0 0
2 1 0 0 1 0
我试过你的方法,但无法复制:
使用:
library(tm)
library(qdap)
Docs <- c(Doc1, Doc2, Doc3, Doc4, Doc5)
text <- removeWords(Docs, stopwords("english"))
text <- removePunctuation(text)
text <- tolower(text)
corp <- Corpus(VectorSource(text))
Phrases <- tolower(Phrases)
word.freq <- apply_as_df(corp, termco_d, match.string = Phrases)
mcsv_w(word.freq, dir = NULL, open = T, sep = ", ", dataframes = NULL,
pos = 1, envir = as.environment(pos))
我得到以下 csv:
docs word.count term(just starting) term(several kilometers) term(brief stroll) term(gradually boost) term(5 miles) term(dark night) term(cold morning)
1 7 1 0 0 0 0 0 0
2 12 0 1 0 0 0 0 0
3 7 0 0 0 0 0 0 0
4 9 0 0 1 1 0 0 0
5 7 0 0 0 0 0 0 0