R - 删除字符中长度为一的字符串和停用词(使用 tidytext)
R - delete length-one strings and stopwords (using tidytext) in character
如果我有df:
Class sentence
1 Yes there is p beaker on the table
2 Yes they t the frown
3 Yes so Z it was asleep
如何删除“句子”列中的长度为一的字符串以删除诸如“t”、“p”和“Z”之类的内容,然后使用 tidytext 中的 stop_words 列表进行最终清理获取以下内容?
Class sentence
1 Yes beaker table
2 Yes frown
3 Yes asleep
如果我们想使用tidytext
,那么创建一个序列列(row_number()
),然后在sentence
列上应用unnest_tokens
,做一个anti_join
使用来自 get_stopwords()
的默认数据,filter
输出只有 1 个字符的单词,然后在 'word' 列上按 paste
进行分组以创建 'sentence'
library(dplyr)
library(tidytext)
library(stringr)
df %>%
mutate(rn = row_number()) %>%
unnest_tokens(word, sentence) %>%
anti_join(get_stopwords()) %>%
filter(nchar(word) > 1) %>%
group_by(rn, Class) %>%
summarise(sentence = str_c(word, collapse = ' '), .groups = 'drop') %>%
select(-rn)
-输出
# A tibble: 3 x 2
Class sentence
<chr> <chr>
1 Yes beaker table
2 Yes frown
3 Yes asleep
数据
df <- structure(list(Class = c("Yes", "Yes", "Yes"), sentence = c("there is p beaker on the table",
"they t the frown", "so Z it was asleep")),
class = "data.frame", row.names = c("1",
"2", "3"))
如果我有df:
Class sentence
1 Yes there is p beaker on the table
2 Yes they t the frown
3 Yes so Z it was asleep
如何删除“句子”列中的长度为一的字符串以删除诸如“t”、“p”和“Z”之类的内容,然后使用 tidytext 中的 stop_words 列表进行最终清理获取以下内容?
Class sentence
1 Yes beaker table
2 Yes frown
3 Yes asleep
如果我们想使用tidytext
,那么创建一个序列列(row_number()
),然后在sentence
列上应用unnest_tokens
,做一个anti_join
使用来自 get_stopwords()
的默认数据,filter
输出只有 1 个字符的单词,然后在 'word' 列上按 paste
进行分组以创建 'sentence'
library(dplyr)
library(tidytext)
library(stringr)
df %>%
mutate(rn = row_number()) %>%
unnest_tokens(word, sentence) %>%
anti_join(get_stopwords()) %>%
filter(nchar(word) > 1) %>%
group_by(rn, Class) %>%
summarise(sentence = str_c(word, collapse = ' '), .groups = 'drop') %>%
select(-rn)
-输出
# A tibble: 3 x 2
Class sentence
<chr> <chr>
1 Yes beaker table
2 Yes frown
3 Yes asleep
数据
df <- structure(list(Class = c("Yes", "Yes", "Yes"), sentence = c("there is p beaker on the table",
"they t the frown", "so Z it was asleep")),
class = "data.frame", row.names = c("1",
"2", "3"))