手动插入特定主题的停用词

Question

我正在使用 tidytext 的内置 anti_join(get_stopwords()) 命令从技术产品的客户评论数据中清理文档，但我发现输出语料库主要由技术规范组成（例如，Windows 10、720p 相机、380.6 x 258.2 x 22.45（英寸）、IntelCore 等）并带有少量形容词和名词来表示客户对产品的满意度。

是否有任何方便的方法来编译要删除的技术术语列表（例如前面列出的那些）并将其手动插入到 get_stopwords() 或等效函数中以更好地识别那些非技术形容词和名词顾客评论？

Answer 1

您可以创建自己的停用词数据框。此示例使用 HG Wells 的小说和两个 user-specified 停用词（感谢 https://www.tidytextmining.com/tidytext.html）。我不知道那里是否有 tech-related 停用词的信誉良好的语料库。

hgwells <- gutenberg_download(35)
my_stop_words <- data.frame(word=c('time','machine')) # list of your stop words
hgwells %>% unnest_tokens(word,text) %>% 
  anti_join(my_stop_words) # removes words 'time' and 'machine'

手动插入特定主题的停用词

Manually inserting topic-specific stopwords

text-mining

stop-words

dplyr

tidytext