R - 删除字符中长度为一的字符串和停用词（使用 tidytext）

Question

如果我有df:

   Class sentence
1   Yes  there is p beaker on the table
2   Yes  they t the frown
3   Yes  so Z it was asleep

如何删除“句子”列中的长度为一的字符串以删除诸如“t”、“p”和“Z”之类的内容，然后使用 tidytext 中的 stop_words 列表进行最终清理获取以下内容？

   Class sentence
1   Yes  beaker table
2   Yes  frown
3   Yes  asleep

Answer 1

如果我们想使用tidytext，那么创建一个序列列（row_number()），然后在sentence列上应用unnest_tokens，做一个anti_join 使用来自 get_stopwords() 的默认数据，filter 输出只有 1 个字符的单词，然后在 'word' 列上按 paste 进行分组以创建 'sentence'

library(dplyr)
library(tidytext)
library(stringr)
df %>% 
   mutate(rn = row_number()) %>%
   unnest_tokens(word, sentence) %>% 
   anti_join(get_stopwords()) %>% 
   filter(nchar(word) > 1) %>%
   group_by(rn, Class) %>%
   summarise(sentence = str_c(word, collapse = ' '), .groups = 'drop') %>% 
   select(-rn)

-输出

# A tibble: 3 x 2
  Class sentence    
  <chr> <chr>       
1 Yes   beaker table
2 Yes   frown       
3 Yes   asleep

数据

df <- structure(list(Class = c("Yes", "Yes", "Yes"), sentence = c("there is p beaker on the table", 
"they t the frown", "so Z it was asleep")), 
class = "data.frame", row.names = c("1", 
"2", "3"))

R - 删除字符中长度为一的字符串和停用词（使用 tidytext）

R - delete length-one strings and stopwords (using tidytext) in character

r

extract

gsub

tidytext

数据