删除只出现一次且在 R 中具有低 IDF 的单词

Remove words that occur only once and with low IDF in R

我有一个数据框,其中有一列包含一些文本。我想做三个数据预处理步骤:

1) 删除只出现一次的单词 2) 删除逆文档频率 (IDF) 较低的词和 3) 删除最常出现的词

这是数据示例:

head(stormfront_data$stormfront_self_content)

Output:

[1] "        , ,    stormfront!  thread       members  post  introduction,     \".\"     stumbled   white networking site,    reading & decided  register  account,      largest networking site     white brothers,  sisters!    read : : guidelines  posting - stormfront introduction  stormfront - stormfront  main board consists   forums,  -forums   : newslinks & articles - stormfront ideology  philosophy - stormfront activism - stormfront       network   local level: local  regional - stormfront international - stormfront  ,  .  addition   main board   supply  social groups    utilized  networking.  final note:      steps    sustaining member,  core member      site online,   affords  additional online features. sf: shopping cart   stormfront!"
[2] "bonjour      warm  brother !   forward  speaking     !"                                                                                                                      
[3] " check   time  time   forums.      frequently    moved  columbia   distinctly  numbered.    groups  gatherings         "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[4] "  !  site  pretty nice.    amount  news articles.  main concern   moment  islamification."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
[5] " , discovered  site   weeks ago.  finally decided  join   found  article  wanted  share  .   proud   race   long time    idea  site    people  shared  views existed."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
[6] "  white brothers,  names jay      member   years,        bit  info    ?    stormfront meet ups     ? stay strong guys    jay, uk"                                                                                                           

任何帮助将不胜感激,因为我对 R 不太熟悉。

这是 tidytext

的方法
library(tidytext)
library(dplyr)
word_count <- tibble(document = seq(1,nrow(data)), text = data) %>%
  unnest_tokens(word, text) %>%
  count(document, word, sort = TRUE)

total_count <- tibble(document = seq(1,nrow(data)), text = data) %>%
  unnest_tokens(word, text) %>%
  group_by(word) %>% 
  summarize(total = n()) 

words <- left_join(word_count,total_count)

words %>%
  bind_tf_idf(word, document, n)
# A tibble: 111 x 7
   document word             n total     tf   idf tf_idf
      <int> <chr>        <int> <int>  <dbl> <dbl>  <dbl>
 1        1 stormfront      10    11 0.139  1.10  0.153 
 2        1 networking       3     3 0.0417 1.79  0.0747
 3        1 site             3     6 0.0417 0.693 0.0289
 4        1 board            2     2 0.0278 1.79  0.0498
 5        1 forums           2     3 0.0278 1.10  0.0305
 6        1 introduction     2     2 0.0278 1.79  0.0498
 7        1 local            2     2 0.0278 1.79  0.0498
 8        1 main             2     3 0.0278 1.10  0.0305
 9        1 member           2     3 0.0278 1.10  0.0305
10        1 online           2     2 0.0278 1.79  0.0498
# … with 101 more rows

从这里开始,使用 dplyr::filter 进行过滤很简单,但是由于您没有定义除 "only once" 之外的任何特定条件,我将把它留给您。

数据

data <- structure(c("        , ,    stormfront!  thread       members  post  introduction,     \".\"     stumbled   white networking site,    reading & decided  register  account,      largest networking site     white brothers,  sisters!    read : : guidelines  posting - stormfront introduction  stormfront - stormfront  main board consists   forums,  -forums   : newslinks & articles - stormfront ideology  philosophy - stormfront activism - stormfront       network   local level: local  regional - stormfront international - stormfront  ,  .  addition   main board   supply  social groups    utilized  networking.  final note:      steps    sustaining member,  core member      site online,   affords  additional online features. sf: shopping cart   stormfront!", 
"bonjour      warm  brother !   forward  speaking     !", " check   time  time   forums.      frequently    moved  columbia   distinctly  numbered.    groups  gatherings         ", 
"  !  site  pretty nice.    amount  news articles.  main concern   moment  islamification.", 
" , discovered  site   weeks ago.  finally decided  join   found  article  wanted  share  .   proud   race   long time    idea  site    people  shared  views existed.", 
"  white brothers,  names jay      member   years,        bit  info    ?    stormfront meet ups     ? stay strong guys    jay, uk"
), .Dim = c(6L, 1L))

以下是 Q1 的几个步骤的解决方案:

第 1 步:通过删除任何非字母数字的内容来清理数据 (\W):

data2 <- trimws(paste0(gsub("\W+", " ", data), collapse = ""))

第 2 步:制作单词的排序频率列表:

fw <- as.data.frame(sort(table(strsplit(data2, "\s{1,}")), decreasing = T))

第 3 步:定义要匹配的模式(即所有只出现一次的单词),确保将它们包装到边界位置标记 (\b) 中,以便只匹配完全匹配的内容(例如, network但不是 networking):

pattern <- paste0("\b(", paste0(fw$Var1[fw$Freq==1], collapse = "|"), ")\b")

第 4 步:删除匹配的词:

data3 <- gsub(pattern, "", data2)

第 5 步:通过删除多余的空格进行清理:

data4 <- trimws(gsub("\s{1,}", " ", data3))

结果:

[1] "stormfront introduction white networking site decided networking site white brothers stormfront introduction stormfront stormfront main board forums forums articles stormfront stormfront stormfront local local stormfront stormfront main board groups networking member member site online online stormfront time time forums groups site articles main site decided time site white brothers jay member stormfront jay"

基础 R 解决方案:

# Remove double spacing and punctuation at the start of strings: 
# cleaned_str => character vector
cstr <- trimws(gsub("\s*[[:punct:]]+", "", trimws(gsub('\s+|^\s*[[:punct:]]+|"',
                    ' ', df), "both")), "both")

# Calulate the document frequency: document_freq => data.frame
document_freq <- data.frame(table(unlist(sapply(cstr, function(x){
  unique(unlist(strsplit(x, "[^a-z]+")))}))))

# Store the inverse document frequency as a vector: idf => double vector: 
document_freq$idf <- log(length(cstr)/document_freq$Freq)

# For each record remove terms that occur only once, occur the maximum number 
# of times a word occurs in the dataset, or words with a "low" idf: 
# pp_records => character vector
pp_records <- do.call("rbind", lapply(cstr, function(x){
    # Store the term and corresponding term frequency as a data.frame: tf_dataf => data.frame
    tf_dataf <- data.frame(table(as.character(na.omit(gsub("^$", NA_character_, 
                                                           unlist(strsplit(x, "[^a-z]+")))))),
                           stringsAsFactors = FALSE)

    # Store a vector containing each term's idf: idf => double vector
    tf_dataf$idf <- document_freq$idf[match(tf_dataf$Var1, document_freq$Var1)]

    # Explicitly return the ppd vector: .GlobalEnv() => character vector
    return(
      data.frame(
        cleaned_record = x,
        pp_records =
          paste0(unique(unlist(
            strsplit(gsub("\s+", " ",
                          trimws(
                            gsub(paste0(tf_dataf$Var1[tf_dataf$Freq == 1 |
                                                        tf_dataf$idf < (quantile(tf_dataf$idf, .25) - (1.5 * IQR(tf_dataf$idf))) |
                                                        tf_dataf$Freq == max(tf_dataf$Freq)],
                                        collapse = "|"), "", x), "both"
                          )), "\s")
          )), collapse = " "),
        row.names = NULL,
        stringsAsFactors = FALSE
      )
    )
  }
))

# Column bind cleaned strings with the original records: ppd_cleaned_df => data.frame 
ppd_cleaned_df <- cbind(orig_record = df, pp_records)

# Output to console: ppd_cleaned_df => stdout (console)
ppd_cleaned_df