替换单词列表中的单词
Replace words from list of words
我有这个数据框
df <- structure(list(ID = 1:3, Text = c("there was not clostridium", "clostridium difficile positive", "test was OK but there was clostridium")), class = "data.frame", row.names = c(NA, -3L))
ID Text
1 1 there was not clostridium
2 2 clostridium difficile positive
3 3 test was OK but there was clostridium
以及停用词的模式
stop <- paste0(c("was", "but", "there"), collapse = "|")
我想通过 ID 中的文本并从停止模式中删除单词
保持单词的顺序很重要。我不想使用合并功能。
我试过了
df$Words <- tokenizers::tokenize_words(df$Text, lowercase = TRUE) ##I would like to make a list of single words
for (i in length(df$Words)){
df$clean <- lapply(df$Words, function(y) lapply(1:length(df$Words[i]),
function(x) stringr::str_replace(unlist(y) == x, stop, "REPLACED")))
}
但这给了我一个逻辑字符串向量而不是单词列表。
> df
ID Text Words clean
1 1 there was not clostridium there, was, not, clostridium FALSE, FALSE, FALSE, FALSE
2 2 clostridium difficile positive clostridium, difficile, positive FALSE, FALSE, FALSE
3 3 test was OK but there was clostridium test, was, ok, but, there, was, clostridium FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE
我想得到这个(替换停止模式中的所有单词并保持单词顺序)
> df
ID Text Words clean
1 1 there was not clostridium there, was, not, clostridium "REPLACED", "REPLACED", not, clostridium
2 2 clostridium difficile positive clostridium, difficile, positive clostridium, difficile, positive
3 3 test was OK but there was clostridium test, was, ok, but, there, was, clostridium test, "REPLACED", OK, "REPLACED", "REPLACED", "REPLACED", clostridium
你可以使用data.table
df = as.data.table(df)[, clean := lapply(Words, function(x) gsub(stop, "REPLACED", x))]
或者您可以使用 dplyr
(并且不要创建列词):
df$clean = lapply(strsplit(df$Text, " "), function(x) gsub(stop, "REPLACED", x))
Tidyverse 解决方案:
首先,您需要修改停止向量,使 i 在停止词前后包含 \b。 \b = 单词边界并避免从单词中意外删除模式。
library(stringr)
library(dplyr)
stop <- paste0(c("\bwas\b", "\bbut\b", "\bther\b"), collapse = "|")
然后用 str_remove_all 删除。
但是,这将留下双空格,可以使用 str_replace_all 将其删除,并将两个空格更改为一个空格。
df %>% mutate(Words = str_remove_all(Text, stop)) %>%
mutate(Words = str_replace_all(Words, "\s{2}", " "))
这会产生以下结果(添加了一个“我被黄蜂咬了”以检查它没有擦除它。
# A tibble: 4 x 3
ID Text Words
<int> <chr> <chr>
1 1 there was not clostridium there not clostridium
2 2 clostridium difficile positive clostridium difficile positive
3 3 test was OK but there was clostridium test OK there clostridium
4 4 I was bit by a wasp I bit by a wasp
我有这个数据框
df <- structure(list(ID = 1:3, Text = c("there was not clostridium", "clostridium difficile positive", "test was OK but there was clostridium")), class = "data.frame", row.names = c(NA, -3L))
ID Text
1 1 there was not clostridium
2 2 clostridium difficile positive
3 3 test was OK but there was clostridium
以及停用词的模式
stop <- paste0(c("was", "but", "there"), collapse = "|")
我想通过 ID 中的文本并从停止模式中删除单词 保持单词的顺序很重要。我不想使用合并功能。
我试过了
df$Words <- tokenizers::tokenize_words(df$Text, lowercase = TRUE) ##I would like to make a list of single words
for (i in length(df$Words)){
df$clean <- lapply(df$Words, function(y) lapply(1:length(df$Words[i]),
function(x) stringr::str_replace(unlist(y) == x, stop, "REPLACED")))
}
但这给了我一个逻辑字符串向量而不是单词列表。
> df
ID Text Words clean
1 1 there was not clostridium there, was, not, clostridium FALSE, FALSE, FALSE, FALSE
2 2 clostridium difficile positive clostridium, difficile, positive FALSE, FALSE, FALSE
3 3 test was OK but there was clostridium test, was, ok, but, there, was, clostridium FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE
我想得到这个(替换停止模式中的所有单词并保持单词顺序)
> df
ID Text Words clean
1 1 there was not clostridium there, was, not, clostridium "REPLACED", "REPLACED", not, clostridium
2 2 clostridium difficile positive clostridium, difficile, positive clostridium, difficile, positive
3 3 test was OK but there was clostridium test, was, ok, but, there, was, clostridium test, "REPLACED", OK, "REPLACED", "REPLACED", "REPLACED", clostridium
你可以使用data.table
df = as.data.table(df)[, clean := lapply(Words, function(x) gsub(stop, "REPLACED", x))]
或者您可以使用 dplyr
(并且不要创建列词):
df$clean = lapply(strsplit(df$Text, " "), function(x) gsub(stop, "REPLACED", x))
Tidyverse 解决方案:
首先,您需要修改停止向量,使 i 在停止词前后包含 \b。 \b = 单词边界并避免从单词中意外删除模式。
library(stringr)
library(dplyr)
stop <- paste0(c("\bwas\b", "\bbut\b", "\bther\b"), collapse = "|")
然后用 str_remove_all 删除。 但是,这将留下双空格,可以使用 str_replace_all 将其删除,并将两个空格更改为一个空格。
df %>% mutate(Words = str_remove_all(Text, stop)) %>%
mutate(Words = str_replace_all(Words, "\s{2}", " "))
这会产生以下结果(添加了一个“我被黄蜂咬了”以检查它没有擦除它。
# A tibble: 4 x 3
ID Text Words
<int> <chr> <chr>
1 1 there was not clostridium there not clostridium
2 2 clostridium difficile positive clostridium difficile positive
3 3 test was OK but there was clostridium test OK there clostridium
4 4 I was bit by a wasp I bit by a wasp