从字符向量中提取并计算常见的词对
Extract and count common word-pairs from character vector
如何在字符向量中找到频繁的相邻单词对?例如,使用原始数据集,一些常见的对是 "crude oil"、"oil market" 和 "million barrels".
下面这个小例子的代码尝试识别频繁词,然后使用正向先行断言,计算这些频繁词后面紧跟一个频繁词的次数。但是这次尝试失败了。
任何关于如何创建数据框的指导将不胜感激,该数据框在第一列 ("Pairs") 中显示公共对,在第二列 ("Count") 中显示次数他们出现在文本中。
library(qdap)
library(tm)
# from the crude data set, create a text file from the first three documents, then clean it
text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, " ", "") # replace double spaces with single space
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))
# pick the top 10 individual words by frequency, since they will likely form the most common pairs
freq.terms <- head(freq_terms(text.var = text), 10)
# create a pattern from the top words for the regex expression below
freq.terms.pat <- str_c(freq.terms$WORD, collapse = "|")
# match frequent terms that are followed by a frequent term
library(stringr)
pairs <- str_extract_all(string = text, pattern = "freq.terms.pat(?= freq.terms.pat)")
这就是努力失败的地方。
不知道 Java 或 Python,这些没有帮助 Java count word pairs Python count word pairs 但它们可能对其他人有用。
谢谢。
首先,修改您的初始 text
列表:
text <- c(crude[[1]][1], crude[[2]][2], crude[[3]][3])
至:
text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
然后,您可以继续进行文本清理(请注意,您的方法会创建格式错误的单词,例如 "oilcanadian"
,但对于手头的示例来说已经足够了):
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, " ", "")
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))
建立一个新的语料库:
v <- Corpus(VectorSource(text))
创建二元词分词器函数:
BigramTokenizer <- function(x) {
unlist(
lapply(ngrams(words(x), 2), paste, collapse = " "),
use.names = FALSE
)
}
使用控制参数 tokenize
创建您的 TermDocumentMatrix
:
tdm <- TermDocumentMatrix(v, control = list(tokenize = BigramTokenizer))
现在您有了新的 tdm
,要获得所需的输出,您可以:
library(dplyr)
data.frame(inspect(tdm)) %>%
add_rownames() %>%
mutate(total = rowSums(.[,-1])) %>%
arrange(desc(total))
给出:
#Source: local data frame [272 x 5]
#
# rowname X1 X2 X3 total
#1 crude oil 2 0 1 3
#2 mln bpd 0 3 0 3
#3 oil prices 0 3 0 3
#4 cut contract 2 0 0 2
#5 demand opec 0 2 0 2
#6 dlrs barrel 2 0 0 2
#7 effective today 1 0 1 2
#8 emergency meeting 0 2 0 2
#9 oil companies 1 1 0 2
#10 oil industry 0 2 0 2
#.. ... .. .. .. ...
这里的一个想法是创建一个包含双字母组的新语料库。:
A bigram or digram is every sequence of two adjacent elements in a string of tokens
提取二元语法的递归函数:
bigram <-
function(xs){
if (length(xs) >= 2)
c(paste(xs[seq(2)],collapse='_'),bigram(tail(xs,-1)))
}
然后将其应用于 tm
包中的原始数据。 (我在这里做了一些文本清理,但是这个步骤取决于文本)。
res <- unlist(lapply(crude,function(x){
x <- tm::removeNumbers(tolower(x))
x <- gsub('\n|[[:punct:]]',' ',x)
x <- gsub(' +','',x)
## after cleaning a compute frequency using table
freqs <- table(bigram(strsplit(x," ")[[1]]))
freqs[freqs>1]
}))
as.data.frame(tail(sort(res),5))
tail(sort(res), 5)
reut-00022.xml.hold_a 3
reut-00022.xml.in_the 3
reut-00011.xml.of_the 4
reut-00022.xml.a_futures 4
reut-00010.xml.abdul_aziz 5
二元组 "abdul aziz" 和 "a futures" 是最常见的。您应该重新清理数据以删除(of,the,..)。但这应该是一个好的开始。
在 OP 评论后编辑:
如果你想获得所有语料库的二元组频率,想法是计算循环中的二元组,然后计算循环结果的频率。我受益于添加更好的文本处理清理。
res <- unlist(lapply(crude,function(x){
x <- removeNumbers(tolower(x))
x <- removeWords(x, words=c("the","of"))
x <- removePunctuation(x)
x <- gsub('\n|[[:punct:]]',' ',x)
x <- gsub(' +','',x)
## after cleaning a compute frequency using table
words <- strsplit(x," ")[[1]]
bigrams <- bigram(words[nchar(words)>2])
}))
xx <- as.data.frame(table(res))
setDT(xx)[order(Freq)]
# res Freq
# 1: abdulaziz_bin 1
# 2: ability_hold 1
# 3: ability_keep 1
# 4: ability_sell 1
# 5: able_hedge 1
# ---
# 2177: last_month 6
# 2178: crude_oil 7
# 2179: oil_minister 7
# 2180: world_oil 7
# 2181: oil_prices 14
如何在字符向量中找到频繁的相邻单词对?例如,使用原始数据集,一些常见的对是 "crude oil"、"oil market" 和 "million barrels".
下面这个小例子的代码尝试识别频繁词,然后使用正向先行断言,计算这些频繁词后面紧跟一个频繁词的次数。但是这次尝试失败了。
任何关于如何创建数据框的指导将不胜感激,该数据框在第一列 ("Pairs") 中显示公共对,在第二列 ("Count") 中显示次数他们出现在文本中。
library(qdap)
library(tm)
# from the crude data set, create a text file from the first three documents, then clean it
text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, " ", "") # replace double spaces with single space
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))
# pick the top 10 individual words by frequency, since they will likely form the most common pairs
freq.terms <- head(freq_terms(text.var = text), 10)
# create a pattern from the top words for the regex expression below
freq.terms.pat <- str_c(freq.terms$WORD, collapse = "|")
# match frequent terms that are followed by a frequent term
library(stringr)
pairs <- str_extract_all(string = text, pattern = "freq.terms.pat(?= freq.terms.pat)")
这就是努力失败的地方。
不知道 Java 或 Python,这些没有帮助 Java count word pairs Python count word pairs 但它们可能对其他人有用。
谢谢。
首先,修改您的初始 text
列表:
text <- c(crude[[1]][1], crude[[2]][2], crude[[3]][3])
至:
text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
然后,您可以继续进行文本清理(请注意,您的方法会创建格式错误的单词,例如 "oilcanadian"
,但对于手头的示例来说已经足够了):
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, " ", "")
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))
建立一个新的语料库:
v <- Corpus(VectorSource(text))
创建二元词分词器函数:
BigramTokenizer <- function(x) {
unlist(
lapply(ngrams(words(x), 2), paste, collapse = " "),
use.names = FALSE
)
}
使用控制参数 tokenize
创建您的 TermDocumentMatrix
:
tdm <- TermDocumentMatrix(v, control = list(tokenize = BigramTokenizer))
现在您有了新的 tdm
,要获得所需的输出,您可以:
library(dplyr)
data.frame(inspect(tdm)) %>%
add_rownames() %>%
mutate(total = rowSums(.[,-1])) %>%
arrange(desc(total))
给出:
#Source: local data frame [272 x 5]
#
# rowname X1 X2 X3 total
#1 crude oil 2 0 1 3
#2 mln bpd 0 3 0 3
#3 oil prices 0 3 0 3
#4 cut contract 2 0 0 2
#5 demand opec 0 2 0 2
#6 dlrs barrel 2 0 0 2
#7 effective today 1 0 1 2
#8 emergency meeting 0 2 0 2
#9 oil companies 1 1 0 2
#10 oil industry 0 2 0 2
#.. ... .. .. .. ...
这里的一个想法是创建一个包含双字母组的新语料库。:
A bigram or digram is every sequence of two adjacent elements in a string of tokens
提取二元语法的递归函数:
bigram <-
function(xs){
if (length(xs) >= 2)
c(paste(xs[seq(2)],collapse='_'),bigram(tail(xs,-1)))
}
然后将其应用于 tm
包中的原始数据。 (我在这里做了一些文本清理,但是这个步骤取决于文本)。
res <- unlist(lapply(crude,function(x){
x <- tm::removeNumbers(tolower(x))
x <- gsub('\n|[[:punct:]]',' ',x)
x <- gsub(' +','',x)
## after cleaning a compute frequency using table
freqs <- table(bigram(strsplit(x," ")[[1]]))
freqs[freqs>1]
}))
as.data.frame(tail(sort(res),5))
tail(sort(res), 5)
reut-00022.xml.hold_a 3
reut-00022.xml.in_the 3
reut-00011.xml.of_the 4
reut-00022.xml.a_futures 4
reut-00010.xml.abdul_aziz 5
二元组 "abdul aziz" 和 "a futures" 是最常见的。您应该重新清理数据以删除(of,the,..)。但这应该是一个好的开始。
在 OP 评论后编辑:
如果你想获得所有语料库的二元组频率,想法是计算循环中的二元组,然后计算循环结果的频率。我受益于添加更好的文本处理清理。
res <- unlist(lapply(crude,function(x){
x <- removeNumbers(tolower(x))
x <- removeWords(x, words=c("the","of"))
x <- removePunctuation(x)
x <- gsub('\n|[[:punct:]]',' ',x)
x <- gsub(' +','',x)
## after cleaning a compute frequency using table
words <- strsplit(x," ")[[1]]
bigrams <- bigram(words[nchar(words)>2])
}))
xx <- as.data.frame(table(res))
setDT(xx)[order(Freq)]
# res Freq
# 1: abdulaziz_bin 1
# 2: ability_hold 1
# 3: ability_keep 1
# 4: ability_sell 1
# 5: able_hedge 1
# ---
# 2177: last_month 6
# 2178: crude_oil 7
# 2179: oil_minister 7
# 2180: world_oil 7
# 2181: oil_prices 14