具有停用词删除选项的 R text2vec 包中的哈希向量化器

Question

我正在使用 R text2vec 包来创建文档术语矩阵。这是我的代码：

library(lime)
library(text2vec) 

# load data
data(train_sentences, package = "lime")  

#
tokens <- train_sentences$text %>%  
   word_tokenizer

it <- itoken(tokens, progressbar = FALSE)

stop_words <- c("in","the","a","at","for","is","am") # stopwords
vocab <- create_vocabulary(it, c(1L, 2L), stopwords = stop_words) %>%   
  prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
vectorizer <- vocab_vectorizer(vocab )

dtm <- create_dtm(it , vectorizer, type = "dgTMatrix")

另一种方法是 hash_vectorizer() 而不是 vocab_vectorizer() 作为：

h_vectorizer <- hash_vectorizer(hash_size = 2 ^ 10, ngram = c(1L, 2L))
dtm <- create_dtm(it,h_vectorizer)

但是当我使用hash_vectorizer时，没有删除停用词和修剪词汇的选项。在一个研究案例中，hash_vectorizer 对我来说比 vocab_vectorizer 更好。我知道可以在创建 dtm 之后甚至在创建令牌时删除停用词。是否有任何其他选项，类似于 vocab_vectorizer 及其创建方式。特别是我对一种方法感兴趣，该方法也支持类似于 prune_vocabulary() 的修剪词汇表。

感谢您的回复。谢谢，山姆

Answer 1

这是不可能的。使用 hash_vectorizer 和特征散列的全部意义在于避免散列映射查找（获取给定单词的索引）。删除停用词本质上是事情 - 检查单词是否在停用词集中。通常建议仅在数据集非常大并且需要大量 time/memory 来构建词汇表时才使用 hash_vectorizer。否则根据我的经验 vocab_vectorizer 和 prune_vocabulary 至少不会更差。

此外，如果您将 hash_vectorized 与小 hash_size 一起使用，它会作为降维步骤，因此可以减少数据集的方差。因此，如果您的数据集不是很大，我建议使用 vocab_vectorizer 并使用 prune_vocabulary 参数来减少词汇量和文档术语矩阵的大小。

具有停用词删除选项的 R text2vec 包中的哈希向量化器

hash vectorizer in R text2vec package with stopwords removal option

r

vocabulary

stop-words

text2vec