使用 R 和术语文档矩阵创建频率 table
Create Frequency table using R and Term document Matrix
我创建了以下由一些电子邮件主题行组成的数据框。
df <- data.frame(subject=c('Free ! Free! Free ! Clear Cover with New Phone',
'Offer ! Buy New phone and get earphone at 1000. Limited Offer!'))
我创建了一个从上述数据框派生的常用词列表。我已将这些关键字添加到数据框并将它们虚拟编码为 0
most_freq_words <- c('Free', 'New', 'Limited', 'Offer')
Subject Free New Limited Offer
'Free Free Free! Clear Cover with New Phone', 0 0 0 0
'Offer ! Buy New phone and get earphone at 0 0 0 0
1000. Limited Offer!'
我想获取邮件主题中单词出现的频率。输出应该如下
Subject Free New Limited Offer
'Free Free Free! Clear Cover with New Phone', 3 1 0 0
'Offer ! Buy New phone and get earphone at 0 1 1 2
1000. Limited Offer!'
我试过下面的代码
for (i in 1:length(most_freq_words)){
df[[most_freq_words[i]]] <- as.numeric(grepl(tolower(most_freq_words[i]),
tolower(df$subject)))}
然而,这表明句子中是否存在该词。我需要上面给出的输出。我请求某人帮助我
将 grepl
替换为 gregexpr
,然后检查 1st
列表项的 length
。此外,for-loop
也应该在 df
的每一行上 运行。保持 OP 的 for-loop
意图,修改后的代码将如下所示:
for (i in 1:length(most_freq_words)){
for(j in 1:nrow(df)){
df[j,most_freq_words[i]] <- ifelse(gregexpr(tolower(most_freq_words[i]),
tolower(df$subject[j]))[[1]][[1]] >0,
length(gregexpr(tolower(most_freq_words[i]), tolower(df$subject[j]))[[1]]), 0)
}
}
> df
subject Free New Limited Offer
1 Free ! Free! Free ! Clear Cover with New Phone 3 1 0 0
2 Offer ! Buy New phone and get earphone at 1000. Limited Offer! 0 1 1 2
我用 tidytext 包处理了这个任务。首先,我在数据集中添加了一个分组变量。然后,我使用 unnest_token()
分隔单词。除了 most_freq_words
中的单词,我删除了所有单词。然后,我统计每个单词在每个句子中出现了多少次。最后,我将 long-format 数据转换为 wide-format 数据。如果您仍然想要原始句子,可以轻松地将其添加到输出中(例如,在 spread()
行之后添加 cbind(subject = df$subject)
)
library(dplyr)
library(tidytext)
df <- data.frame(subject=c('Free ! Free! Free ! Clear Cover with New Phone',
'Offer ! Buy New phone and get earphone at 1000. Limited Offer!'),
stringsAsFactors = FALSE)
most_freq_words <- c('Free', 'New', 'Limited', 'Offer')
mutate(df, group = 1:n()) %>%
unnest_tokens(input = subject, output = word, token = "words", to_lower = FALSE) %>%
filter(word %in% most_freq_words) %>%
count(group, word) %>%
spread(key = word, value = n, fill = 0)
group Free Limited New Offer
<int> <dbl> <dbl> <dbl> <dbl>
1 1 3.00 0 1.00 0
2 2 0 1.00 1.00 2.00
这里是 tidyverse
的另一个选项。我们使用 map
遍历 'most_freq_words',用 str_count
从 'df' 的 'subject' 列获取它的计数,转换为 tibble
,设置'most_freq_words' 中的列名称并将列与原始数据集绑定 'df'
library(tidyverse)
most_freq_words %>%
map(~ str_count(df$subject, .x) %>%
as_tibble %>%
set_names(.x)) %>%
bind_cols(df, .)
# subject Free New Limited Offer
#1 Free ! Free! Free ! Clear Cover with New Phone 3 1 0 0
#2 Offer ! Buy New phone and get earphone at 1000. Limited Offer! 0 1 1 2
我创建了以下由一些电子邮件主题行组成的数据框。
df <- data.frame(subject=c('Free ! Free! Free ! Clear Cover with New Phone',
'Offer ! Buy New phone and get earphone at 1000. Limited Offer!'))
我创建了一个从上述数据框派生的常用词列表。我已将这些关键字添加到数据框并将它们虚拟编码为 0
most_freq_words <- c('Free', 'New', 'Limited', 'Offer')
Subject Free New Limited Offer
'Free Free Free! Clear Cover with New Phone', 0 0 0 0
'Offer ! Buy New phone and get earphone at 0 0 0 0
1000. Limited Offer!'
我想获取邮件主题中单词出现的频率。输出应该如下
Subject Free New Limited Offer
'Free Free Free! Clear Cover with New Phone', 3 1 0 0
'Offer ! Buy New phone and get earphone at 0 1 1 2
1000. Limited Offer!'
我试过下面的代码
for (i in 1:length(most_freq_words)){
df[[most_freq_words[i]]] <- as.numeric(grepl(tolower(most_freq_words[i]),
tolower(df$subject)))}
然而,这表明句子中是否存在该词。我需要上面给出的输出。我请求某人帮助我
将 grepl
替换为 gregexpr
,然后检查 1st
列表项的 length
。此外,for-loop
也应该在 df
的每一行上 运行。保持 OP 的 for-loop
意图,修改后的代码将如下所示:
for (i in 1:length(most_freq_words)){
for(j in 1:nrow(df)){
df[j,most_freq_words[i]] <- ifelse(gregexpr(tolower(most_freq_words[i]),
tolower(df$subject[j]))[[1]][[1]] >0,
length(gregexpr(tolower(most_freq_words[i]), tolower(df$subject[j]))[[1]]), 0)
}
}
> df
subject Free New Limited Offer
1 Free ! Free! Free ! Clear Cover with New Phone 3 1 0 0
2 Offer ! Buy New phone and get earphone at 1000. Limited Offer! 0 1 1 2
我用 tidytext 包处理了这个任务。首先,我在数据集中添加了一个分组变量。然后,我使用 unnest_token()
分隔单词。除了 most_freq_words
中的单词,我删除了所有单词。然后,我统计每个单词在每个句子中出现了多少次。最后,我将 long-format 数据转换为 wide-format 数据。如果您仍然想要原始句子,可以轻松地将其添加到输出中(例如,在 spread()
行之后添加 cbind(subject = df$subject)
)
library(dplyr)
library(tidytext)
df <- data.frame(subject=c('Free ! Free! Free ! Clear Cover with New Phone',
'Offer ! Buy New phone and get earphone at 1000. Limited Offer!'),
stringsAsFactors = FALSE)
most_freq_words <- c('Free', 'New', 'Limited', 'Offer')
mutate(df, group = 1:n()) %>%
unnest_tokens(input = subject, output = word, token = "words", to_lower = FALSE) %>%
filter(word %in% most_freq_words) %>%
count(group, word) %>%
spread(key = word, value = n, fill = 0)
group Free Limited New Offer
<int> <dbl> <dbl> <dbl> <dbl>
1 1 3.00 0 1.00 0
2 2 0 1.00 1.00 2.00
这里是 tidyverse
的另一个选项。我们使用 map
遍历 'most_freq_words',用 str_count
从 'df' 的 'subject' 列获取它的计数,转换为 tibble
,设置'most_freq_words' 中的列名称并将列与原始数据集绑定 'df'
library(tidyverse)
most_freq_words %>%
map(~ str_count(df$subject, .x) %>%
as_tibble %>%
set_names(.x)) %>%
bind_cols(df, .)
# subject Free New Limited Offer
#1 Free ! Free! Free ! Clear Cover with New Phone 3 1 0 0
#2 Offer ! Buy New phone and get earphone at 1000. Limited Offer! 0 1 1 2