在没有anti-join的情况下用R中的tibble中的空格替换单词
Replacing words with spaces within a tibble in R without anti-join
我有一堆这样的句子:
小标题:1,782 x 1
Chat
<chr>
1 Hi i would like to find out more about the trials
2 Hello I had a guest
3 Hello my friend overseas right now
...
我想做的是删除 "I"、"hello" 等停用词。我已经有了它们的列表,我想用 space 替换这些停用词。我尝试使用 mutate 和 gsub 但它只接受正则表达式。反连接在这里不起作用,因为我正在尝试 bigram/trigram 我没有一个单词列来 anti-join 停用词。
有没有办法替换R中每个句子中的所有这些词?
我们可以取消嵌套标记,replace
在 'stop_words' 'word' 列中找到的 'word' space (" "
), 和 paste
'word' 按 'lines'
分组后
library(tidytext)
library(tidyverse)
rowid_to_column(df1, 'lines') %>%
unnest_tokens(word, Chat) %>%
mutate(word = replace(word, word %in% stop_words$word, " ")) %>%
group_by(lines) %>%
summarise(Chat = paste(word, collapse=' ')) %>%
ungroup %>%
select(-lines)
注意:这会将 'stop_words' 数据集中的停用词替换为 " "
如果我们只需要替换停用词的自定义子集,则创建一个 vector
这些停用词元素并在 mutate
步骤
中进行更改
v1 <- c("I", "hello", "Hi")
rowid_to_column(df1, 'lines') %>%
...
...
mutate(word = replace(word %in% v1, " ")) %>%
...
...
我们可以用“\b
停用词\b
”构造一个模式,然后用gsub
将它们替换为“”。这是一个例子。请注意,我将 ignore.case = TRUE
设置为包括小写和大写,但您可能需要根据需要进行调整。
dat <- read.table(text = "Chat
1 'Hi i would like to find out more about the trials'
2 'Hello I had a guest'
3 'Hello my friend overseas right now'",
header = TRUE, stringsAsFactors = FALSE)
dat
# Chat
# 1 Hi i would like to find out more about the trials
# 2 Hello I had a guest
# 3 Hello my friend overseas right now
# A list of stop word
stopword <- c("I", "Hello", "Hi")
# Create the pattern
stopword2 <- paste0("\b", stopword, "\b")
stopword3 <- paste(stopword2, collapse = "|")
# View the pattern
stopword3
# [1] "\bI\b|\bHello\b|\bHi\b"
dat$Chat <- gsub(pattern = stopword3, replacement = " ", x = dat$Chat, ignore.case = TRUE)
dat
# Chat
# 1 would like to find out more about the trials
# 2 had a guest
# 3 my friend overseas right now
我有一堆这样的句子:
小标题:1,782 x 1
Chat
<chr>
1 Hi i would like to find out more about the trials
2 Hello I had a guest
3 Hello my friend overseas right now
...
我想做的是删除 "I"、"hello" 等停用词。我已经有了它们的列表,我想用 space 替换这些停用词。我尝试使用 mutate 和 gsub 但它只接受正则表达式。反连接在这里不起作用,因为我正在尝试 bigram/trigram 我没有一个单词列来 anti-join 停用词。
有没有办法替换R中每个句子中的所有这些词?
我们可以取消嵌套标记,replace
在 'stop_words' 'word' 列中找到的 'word' space (" "
), 和 paste
'word' 按 'lines'
library(tidytext)
library(tidyverse)
rowid_to_column(df1, 'lines') %>%
unnest_tokens(word, Chat) %>%
mutate(word = replace(word, word %in% stop_words$word, " ")) %>%
group_by(lines) %>%
summarise(Chat = paste(word, collapse=' ')) %>%
ungroup %>%
select(-lines)
注意:这会将 'stop_words' 数据集中的停用词替换为 " "
如果我们只需要替换停用词的自定义子集,则创建一个 vector
这些停用词元素并在 mutate
步骤
v1 <- c("I", "hello", "Hi")
rowid_to_column(df1, 'lines') %>%
...
...
mutate(word = replace(word %in% v1, " ")) %>%
...
...
我们可以用“\b
停用词\b
”构造一个模式,然后用gsub
将它们替换为“”。这是一个例子。请注意,我将 ignore.case = TRUE
设置为包括小写和大写,但您可能需要根据需要进行调整。
dat <- read.table(text = "Chat
1 'Hi i would like to find out more about the trials'
2 'Hello I had a guest'
3 'Hello my friend overseas right now'",
header = TRUE, stringsAsFactors = FALSE)
dat
# Chat
# 1 Hi i would like to find out more about the trials
# 2 Hello I had a guest
# 3 Hello my friend overseas right now
# A list of stop word
stopword <- c("I", "Hello", "Hi")
# Create the pattern
stopword2 <- paste0("\b", stopword, "\b")
stopword3 <- paste(stopword2, collapse = "|")
# View the pattern
stopword3
# [1] "\bI\b|\bHello\b|\bHi\b"
dat$Chat <- gsub(pattern = stopword3, replacement = " ", x = dat$Chat, ignore.case = TRUE)
dat
# Chat
# 1 would like to find out more about the trials
# 2 had a guest
# 3 my friend overseas right now