R tidytext 如果相关二元组的一部分，则删除单词，但如果不是，则保留

Question

通过使用 unnest_token，我想创建一个整洁的文本标题，它结合了两个不同的标记：单个单词和双字母组。背后的原因是，有时单个词是更合理的研究单位，有时则是 higher-order n-grams.

如果两个单词显示为 "sensible" 双字母组，我想存储双字母组而不是单独的单词。如果相同的单词出现在不同的上下文中（即不是二元语法），那么我想将它们保存为单个单词。

在下面这个愚蠢的例子中 "of the" 是一个重要的二元语法。因此，如果 "of" 和 "the" 在文本中实际显示为 "of the"，我想删除它们。但是，如果 "of" 和 "the" 出现在其他组合中，我想将它们保留为单个单词。

library(janeaustenr)
library(data.table)
library(dplyr)
library(tidytext)
library(tidyr)


# make unigrams
tide <- unnest_tokens(austen_books() , output = word, input = text )
# make bigrams
tide2 <- unnest_tokens(austen_books(), output = bigrams, input = text, token = "ngrams", n = 2)

# keep only most frequent bigrams (in reality use more sensible metric)
keepbigram <- names( sort( table(tide2$bigrams), decreasing = T)[1:10]  )
keepbigram
tide2 <- tide2[tide2$bigrams %in% keepbigram,]

# this removes all unigrams which show up in relevant bigrams
biwords <- unlist( strsplit( keepbigram, " ") )
biwords
tide[!(tide$word %in% biwords),]

# want to keep biwords in tide if they are not part of bigrams

Answer 1

你可以通过在标记化之前用文本中的复合词替换你感兴趣的双字母来做到这一点（即 unnest_tokens）：

keepbigram_new <- stringi::stri_replace_all_regex(keepbigram, "\s+", "_")
keepbigram_new
#>  [1] "of_the"   "to_be"    "in_the"   "it_was"   "i_am"     "she_had" 
#>  [7] "of_her"   "to_the"   "she_was"  "had_been"

使用 _ 代替空格是这种做法的常见做法。 stringi::stri_replace_all_regex 与 stringr 中的 gsub 或 str_replace 几乎相同，但速度更快且功能更多。

现在在标记化之前用这些新化合物替换文本中的二元组。我在双字母组的开头和结尾使用单词边界正则表达式 (\b) 以避免意外捕获，例如，“of them”：

topwords <- austen_books() %>% 
  mutate(text = stringi::stri_replace_all_regex(text, paste0("\b", keepbigram, "\b"), keepbigram_new, vectorize_all = FALSE)) %>% 
  unnest_tokens(output = word, input = text) %>% 
  count(word, sort = TRUE) %>% 
  mutate(rank = seq_along(word))

查看最常见的单词，第一个双字母组现在出现在第 40 位：

topwords %>% 
  slice(1:4, 39:41)
#> # A tibble: 7 x 3
#>   word       n  rank
#>   <chr>  <int> <int>
#> 1 and    22515     1
#> 2 to     20152     2
#> 3 the    20072     3
#> 4 of     16984     4
#> 5 they    2983    39
#> 6 of_the  2833    40
#> 7 from    2795    41

R tidytext 如果相关二元组的一部分，则删除单词，但如果不是，则保留

R tidytext Remove word if part of relevant bigrams, but keep if not

nlp

r

tidytext