R tidytext stop_words 没有从 gutenbergr 下载中一致地过滤
R tidytext stop_words are not filtering consistently from gutenbergr downloads
这是一个奇怪的谜题。我从 gutenbergr 下载了 2 篇文章 - 爱丽丝梦游仙境和尤利西斯。
stop_words 从 Alice 身边消失了,但他们仍在 Ulysses 中。
即使将 anti_join 替换为
过滤器 (!word %in% stop_words$word).
如何从 Ulysses 中获取 stop_words?
感谢您的帮助!
Plot of top 15 tf_idf for Alice & Ulysses
library(gutenbergr)
library(dplyr)
library(stringr)
library(tidytext)
library(ggplot2)
titles <- c("Alice's Adventures in Wonderland", "Ulysses")
books <- gutenberg_works(title %in% titles) %>%
gutenberg_download(meta_fields = c("title", "author"))
data(stop_words)
tidy_books <- books %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(title, word, sort=TRUE) %>%
ungroup()
plot_tidy_books <- tidy_books %>%
bind_tf_idf(word, title, n) %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
mutate(title = factor(title, levels = unique(title)))
plot_tidy_books %>%
group_by(title) %>%
arrange(desc(n))%>%
top_n(15, tf_idf) %>%
mutate(word=reorder(word, tf_idf)) %>%
ggplot(aes(word, tf_idf, fill=title)) +
geom_col(show.legend = FALSE) +
labs(x=NULL, y="tf-idf") +
facet_wrap(~title, ncol=2, scales="free") +
coord_flip()
在标记化的 Ulysses 中进行了一些挖掘之后,文本 "it's" 实际上使用了右单引号而不是撇号。 stop_words
in tidytext
使用撇号。您必须用撇号替换正确的单引号。
我是通过以下方式发现的:
> utf8ToInt('it’s')
[1] 105 116 8217 115
谷歌搜索 8217 将我带到 here。从那里获取 C++/Java 源代码 \u2019
并在 anti-join
.
之前添加 mutate
和 gsub
语句一样简单
tidy_books <- books %>%
unnest_tokens(word, text) %>%
mutate(word = gsub("\u2019", "'", word)) %>%
anti_join(stop_words) %>%
count(title, word, sort=TRUE) %>%
ungroup()
结果:
这是一个奇怪的谜题。我从 gutenbergr 下载了 2 篇文章 - 爱丽丝梦游仙境和尤利西斯。 stop_words 从 Alice 身边消失了,但他们仍在 Ulysses 中。 即使将 anti_join 替换为 过滤器 (!word %in% stop_words$word).
如何从 Ulysses 中获取 stop_words?
感谢您的帮助!
Plot of top 15 tf_idf for Alice & Ulysses
library(gutenbergr)
library(dplyr)
library(stringr)
library(tidytext)
library(ggplot2)
titles <- c("Alice's Adventures in Wonderland", "Ulysses")
books <- gutenberg_works(title %in% titles) %>%
gutenberg_download(meta_fields = c("title", "author"))
data(stop_words)
tidy_books <- books %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(title, word, sort=TRUE) %>%
ungroup()
plot_tidy_books <- tidy_books %>%
bind_tf_idf(word, title, n) %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
mutate(title = factor(title, levels = unique(title)))
plot_tidy_books %>%
group_by(title) %>%
arrange(desc(n))%>%
top_n(15, tf_idf) %>%
mutate(word=reorder(word, tf_idf)) %>%
ggplot(aes(word, tf_idf, fill=title)) +
geom_col(show.legend = FALSE) +
labs(x=NULL, y="tf-idf") +
facet_wrap(~title, ncol=2, scales="free") +
coord_flip()
在标记化的 Ulysses 中进行了一些挖掘之后,文本 "it's" 实际上使用了右单引号而不是撇号。 stop_words
in tidytext
使用撇号。您必须用撇号替换正确的单引号。
我是通过以下方式发现的:
> utf8ToInt('it’s')
[1] 105 116 8217 115
谷歌搜索 8217 将我带到 here。从那里获取 C++/Java 源代码 \u2019
并在 anti-join
.
mutate
和 gsub
语句一样简单
tidy_books <- books %>%
unnest_tokens(word, text) %>%
mutate(word = gsub("\u2019", "'", word)) %>%
anti_join(stop_words) %>%
count(title, word, sort=TRUE) %>%
ungroup()
结果: