在没有anti-join的情况下用R中的tibble中的空格替换单词

Replacing words with spaces within a tibble in R without anti-join

我有一堆这样的句子:

小标题:1,782 x 1

Chat
<chr>                                                                                                                                                                    
1 Hi i would like to find out more about the trials
2 Hello I had a guest 
3 Hello my friend overseas right now
...

我想做的是删除 "I"、"hello" 等停用词。我已经有了它们的列表,我想用 space 替换这些停用词。我尝试使用 mutate 和 gsub 但它只接受正则表达式。反连接在这里不起作用,因为我正在尝试 bigram/trigram 我没有一个单词列来 anti-join 停用词。

有没有办法替换R中每个句子中的所有这些词?

我们可以取消嵌套标记,replace 在 'stop_words' 'word' 列中找到的 'word' space (" " ), 和 paste 'word' 按 'lines'

分组后
library(tidytext)
library(tidyverse)
rowid_to_column(df1, 'lines') %>% 
     unnest_tokens(word, Chat) %>% 
     mutate(word = replace(word, word %in% stop_words$word, " ")) %>% 
     group_by(lines) %>% 
     summarise(Chat = paste(word, collapse=' ')) %>%
     ungroup %>%
     select(-lines)

注意:这会将 'stop_words' 数据集中的停用词替换为 " " 如果我们只需要替换停用词的自定义子集,则创建一个 vector 这些停用词元素并在 mutate 步骤

中进行更改
v1 <- c("I", "hello", "Hi")
rowid_to_column(df1, 'lines') %>%
  ...
  ...
  mutate(word = replace(word %in% v1, " ")) %>%
  ...
  ...

我们可以用“\b停用词\b”构造一个模式,然后用gsub将它们替换为“”。这是一个例子。请注意,我将 ignore.case = TRUE 设置为包括小写和大写,但您可能需要根据需要进行调整。

dat <- read.table(text = "Chat
                  1 'Hi i would like to find out more about the trials'
                  2 'Hello I had a guest' 
                  3 'Hello my friend overseas right now'",
                  header = TRUE, stringsAsFactors = FALSE)

dat
#                                                Chat
# 1 Hi i would like to find out more about the trials
# 2                               Hello I had a guest
# 3                Hello my friend overseas right now

# A list of stop word
stopword <- c("I", "Hello", "Hi")
# Create the pattern
stopword2 <- paste0("\b", stopword, "\b")
stopword3 <- paste(stopword2, collapse = "|")

# View the pattern
stopword3
# [1] "\bI\b|\bHello\b|\bHi\b"

dat$Chat <- gsub(pattern = stopword3, replacement = " ", x = dat$Chat, ignore.case = TRUE)
dat
#                                               Chat
# 1     would like to find out more about the trials
# 2                                      had a guest
# 3                     my friend overseas right now