删除只出现一次且在 R 中具有低 IDF 的单词
Remove words that occur only once and with low IDF in R
我有一个数据框,其中有一列包含一些文本。我想做三个数据预处理步骤:
1) 删除只出现一次的单词
2) 删除逆文档频率 (IDF) 较低的词和 3) 删除最常出现的词
这是数据示例:
head(stormfront_data$stormfront_self_content)
Output:
[1] " , , stormfront! thread members post introduction, \".\" stumbled white networking site, reading & decided register account, largest networking site white brothers, sisters! read : : guidelines posting - stormfront introduction stormfront - stormfront main board consists forums, -forums : newslinks & articles - stormfront ideology philosophy - stormfront activism - stormfront network local level: local regional - stormfront international - stormfront , . addition main board supply social groups utilized networking. final note: steps sustaining member, core member site online, affords additional online features. sf: shopping cart stormfront!"
[2] "bonjour warm brother ! forward speaking !"
[3] " check time time forums. frequently moved columbia distinctly numbered. groups gatherings "
[4] " ! site pretty nice. amount news articles. main concern moment islamification."
[5] " , discovered site weeks ago. finally decided join found article wanted share . proud race long time idea site people shared views existed."
[6] " white brothers, names jay member years, bit info ? stormfront meet ups ? stay strong guys jay, uk"
任何帮助将不胜感激,因为我对 R 不太熟悉。
这是 tidytext
的方法
library(tidytext)
library(dplyr)
word_count <- tibble(document = seq(1,nrow(data)), text = data) %>%
unnest_tokens(word, text) %>%
count(document, word, sort = TRUE)
total_count <- tibble(document = seq(1,nrow(data)), text = data) %>%
unnest_tokens(word, text) %>%
group_by(word) %>%
summarize(total = n())
words <- left_join(word_count,total_count)
words %>%
bind_tf_idf(word, document, n)
# A tibble: 111 x 7
document word n total tf idf tf_idf
<int> <chr> <int> <int> <dbl> <dbl> <dbl>
1 1 stormfront 10 11 0.139 1.10 0.153
2 1 networking 3 3 0.0417 1.79 0.0747
3 1 site 3 6 0.0417 0.693 0.0289
4 1 board 2 2 0.0278 1.79 0.0498
5 1 forums 2 3 0.0278 1.10 0.0305
6 1 introduction 2 2 0.0278 1.79 0.0498
7 1 local 2 2 0.0278 1.79 0.0498
8 1 main 2 3 0.0278 1.10 0.0305
9 1 member 2 3 0.0278 1.10 0.0305
10 1 online 2 2 0.0278 1.79 0.0498
# … with 101 more rows
从这里开始,使用 dplyr::filter
进行过滤很简单,但是由于您没有定义除 "only once" 之外的任何特定条件,我将把它留给您。
数据
data <- structure(c(" , , stormfront! thread members post introduction, \".\" stumbled white networking site, reading & decided register account, largest networking site white brothers, sisters! read : : guidelines posting - stormfront introduction stormfront - stormfront main board consists forums, -forums : newslinks & articles - stormfront ideology philosophy - stormfront activism - stormfront network local level: local regional - stormfront international - stormfront , . addition main board supply social groups utilized networking. final note: steps sustaining member, core member site online, affords additional online features. sf: shopping cart stormfront!",
"bonjour warm brother ! forward speaking !", " check time time forums. frequently moved columbia distinctly numbered. groups gatherings ",
" ! site pretty nice. amount news articles. main concern moment islamification.",
" , discovered site weeks ago. finally decided join found article wanted share . proud race long time idea site people shared views existed.",
" white brothers, names jay member years, bit info ? stormfront meet ups ? stay strong guys jay, uk"
), .Dim = c(6L, 1L))
以下是 Q1 的几个步骤的解决方案:
第 1 步:通过删除任何非字母数字的内容来清理数据 (\W
):
data2 <- trimws(paste0(gsub("\W+", " ", data), collapse = ""))
第 2 步:制作单词的排序频率列表:
fw <- as.data.frame(sort(table(strsplit(data2, "\s{1,}")), decreasing = T))
第 3 步:定义要匹配的模式(即所有只出现一次的单词),确保将它们包装到边界位置标记 (\b
) 中,以便只匹配完全匹配的内容(例如, network
但不是 networking
):
pattern <- paste0("\b(", paste0(fw$Var1[fw$Freq==1], collapse = "|"), ")\b")
第 4 步:删除匹配的词:
data3 <- gsub(pattern, "", data2)
第 5 步:通过删除多余的空格进行清理:
data4 <- trimws(gsub("\s{1,}", " ", data3))
结果:
[1] "stormfront introduction white networking site decided networking site white brothers stormfront introduction stormfront stormfront main board forums forums articles stormfront stormfront stormfront local local stormfront stormfront main board groups networking member member site online online stormfront time time forums groups site articles main site decided time site white brothers jay member stormfront jay"
基础 R 解决方案:
# Remove double spacing and punctuation at the start of strings:
# cleaned_str => character vector
cstr <- trimws(gsub("\s*[[:punct:]]+", "", trimws(gsub('\s+|^\s*[[:punct:]]+|"',
' ', df), "both")), "both")
# Calulate the document frequency: document_freq => data.frame
document_freq <- data.frame(table(unlist(sapply(cstr, function(x){
unique(unlist(strsplit(x, "[^a-z]+")))}))))
# Store the inverse document frequency as a vector: idf => double vector:
document_freq$idf <- log(length(cstr)/document_freq$Freq)
# For each record remove terms that occur only once, occur the maximum number
# of times a word occurs in the dataset, or words with a "low" idf:
# pp_records => character vector
pp_records <- do.call("rbind", lapply(cstr, function(x){
# Store the term and corresponding term frequency as a data.frame: tf_dataf => data.frame
tf_dataf <- data.frame(table(as.character(na.omit(gsub("^$", NA_character_,
unlist(strsplit(x, "[^a-z]+")))))),
stringsAsFactors = FALSE)
# Store a vector containing each term's idf: idf => double vector
tf_dataf$idf <- document_freq$idf[match(tf_dataf$Var1, document_freq$Var1)]
# Explicitly return the ppd vector: .GlobalEnv() => character vector
return(
data.frame(
cleaned_record = x,
pp_records =
paste0(unique(unlist(
strsplit(gsub("\s+", " ",
trimws(
gsub(paste0(tf_dataf$Var1[tf_dataf$Freq == 1 |
tf_dataf$idf < (quantile(tf_dataf$idf, .25) - (1.5 * IQR(tf_dataf$idf))) |
tf_dataf$Freq == max(tf_dataf$Freq)],
collapse = "|"), "", x), "both"
)), "\s")
)), collapse = " "),
row.names = NULL,
stringsAsFactors = FALSE
)
)
}
))
# Column bind cleaned strings with the original records: ppd_cleaned_df => data.frame
ppd_cleaned_df <- cbind(orig_record = df, pp_records)
# Output to console: ppd_cleaned_df => stdout (console)
ppd_cleaned_df
我有一个数据框,其中有一列包含一些文本。我想做三个数据预处理步骤:
1) 删除只出现一次的单词 2) 删除逆文档频率 (IDF) 较低的词和 3) 删除最常出现的词
这是数据示例:
head(stormfront_data$stormfront_self_content)
Output:
[1] " , , stormfront! thread members post introduction, \".\" stumbled white networking site, reading & decided register account, largest networking site white brothers, sisters! read : : guidelines posting - stormfront introduction stormfront - stormfront main board consists forums, -forums : newslinks & articles - stormfront ideology philosophy - stormfront activism - stormfront network local level: local regional - stormfront international - stormfront , . addition main board supply social groups utilized networking. final note: steps sustaining member, core member site online, affords additional online features. sf: shopping cart stormfront!"
[2] "bonjour warm brother ! forward speaking !"
[3] " check time time forums. frequently moved columbia distinctly numbered. groups gatherings "
[4] " ! site pretty nice. amount news articles. main concern moment islamification."
[5] " , discovered site weeks ago. finally decided join found article wanted share . proud race long time idea site people shared views existed."
[6] " white brothers, names jay member years, bit info ? stormfront meet ups ? stay strong guys jay, uk"
任何帮助将不胜感激,因为我对 R 不太熟悉。
这是 tidytext
library(tidytext)
library(dplyr)
word_count <- tibble(document = seq(1,nrow(data)), text = data) %>%
unnest_tokens(word, text) %>%
count(document, word, sort = TRUE)
total_count <- tibble(document = seq(1,nrow(data)), text = data) %>%
unnest_tokens(word, text) %>%
group_by(word) %>%
summarize(total = n())
words <- left_join(word_count,total_count)
words %>%
bind_tf_idf(word, document, n)
# A tibble: 111 x 7
document word n total tf idf tf_idf
<int> <chr> <int> <int> <dbl> <dbl> <dbl>
1 1 stormfront 10 11 0.139 1.10 0.153
2 1 networking 3 3 0.0417 1.79 0.0747
3 1 site 3 6 0.0417 0.693 0.0289
4 1 board 2 2 0.0278 1.79 0.0498
5 1 forums 2 3 0.0278 1.10 0.0305
6 1 introduction 2 2 0.0278 1.79 0.0498
7 1 local 2 2 0.0278 1.79 0.0498
8 1 main 2 3 0.0278 1.10 0.0305
9 1 member 2 3 0.0278 1.10 0.0305
10 1 online 2 2 0.0278 1.79 0.0498
# … with 101 more rows
从这里开始,使用 dplyr::filter
进行过滤很简单,但是由于您没有定义除 "only once" 之外的任何特定条件,我将把它留给您。
数据
data <- structure(c(" , , stormfront! thread members post introduction, \".\" stumbled white networking site, reading & decided register account, largest networking site white brothers, sisters! read : : guidelines posting - stormfront introduction stormfront - stormfront main board consists forums, -forums : newslinks & articles - stormfront ideology philosophy - stormfront activism - stormfront network local level: local regional - stormfront international - stormfront , . addition main board supply social groups utilized networking. final note: steps sustaining member, core member site online, affords additional online features. sf: shopping cart stormfront!",
"bonjour warm brother ! forward speaking !", " check time time forums. frequently moved columbia distinctly numbered. groups gatherings ",
" ! site pretty nice. amount news articles. main concern moment islamification.",
" , discovered site weeks ago. finally decided join found article wanted share . proud race long time idea site people shared views existed.",
" white brothers, names jay member years, bit info ? stormfront meet ups ? stay strong guys jay, uk"
), .Dim = c(6L, 1L))
以下是 Q1 的几个步骤的解决方案:
第 1 步:通过删除任何非字母数字的内容来清理数据 (\W
):
data2 <- trimws(paste0(gsub("\W+", " ", data), collapse = ""))
第 2 步:制作单词的排序频率列表:
fw <- as.data.frame(sort(table(strsplit(data2, "\s{1,}")), decreasing = T))
第 3 步:定义要匹配的模式(即所有只出现一次的单词),确保将它们包装到边界位置标记 (\b
) 中,以便只匹配完全匹配的内容(例如, network
但不是 networking
):
pattern <- paste0("\b(", paste0(fw$Var1[fw$Freq==1], collapse = "|"), ")\b")
第 4 步:删除匹配的词:
data3 <- gsub(pattern, "", data2)
第 5 步:通过删除多余的空格进行清理:
data4 <- trimws(gsub("\s{1,}", " ", data3))
结果:
[1] "stormfront introduction white networking site decided networking site white brothers stormfront introduction stormfront stormfront main board forums forums articles stormfront stormfront stormfront local local stormfront stormfront main board groups networking member member site online online stormfront time time forums groups site articles main site decided time site white brothers jay member stormfront jay"
基础 R 解决方案:
# Remove double spacing and punctuation at the start of strings:
# cleaned_str => character vector
cstr <- trimws(gsub("\s*[[:punct:]]+", "", trimws(gsub('\s+|^\s*[[:punct:]]+|"',
' ', df), "both")), "both")
# Calulate the document frequency: document_freq => data.frame
document_freq <- data.frame(table(unlist(sapply(cstr, function(x){
unique(unlist(strsplit(x, "[^a-z]+")))}))))
# Store the inverse document frequency as a vector: idf => double vector:
document_freq$idf <- log(length(cstr)/document_freq$Freq)
# For each record remove terms that occur only once, occur the maximum number
# of times a word occurs in the dataset, or words with a "low" idf:
# pp_records => character vector
pp_records <- do.call("rbind", lapply(cstr, function(x){
# Store the term and corresponding term frequency as a data.frame: tf_dataf => data.frame
tf_dataf <- data.frame(table(as.character(na.omit(gsub("^$", NA_character_,
unlist(strsplit(x, "[^a-z]+")))))),
stringsAsFactors = FALSE)
# Store a vector containing each term's idf: idf => double vector
tf_dataf$idf <- document_freq$idf[match(tf_dataf$Var1, document_freq$Var1)]
# Explicitly return the ppd vector: .GlobalEnv() => character vector
return(
data.frame(
cleaned_record = x,
pp_records =
paste0(unique(unlist(
strsplit(gsub("\s+", " ",
trimws(
gsub(paste0(tf_dataf$Var1[tf_dataf$Freq == 1 |
tf_dataf$idf < (quantile(tf_dataf$idf, .25) - (1.5 * IQR(tf_dataf$idf))) |
tf_dataf$Freq == max(tf_dataf$Freq)],
collapse = "|"), "", x), "both"
)), "\s")
)), collapse = " "),
row.names = NULL,
stringsAsFactors = FALSE
)
)
}
))
# Column bind cleaned strings with the original records: ppd_cleaned_df => data.frame
ppd_cleaned_df <- cbind(orig_record = df, pp_records)
# Output to console: ppd_cleaned_df => stdout (console)
ppd_cleaned_df