TidyText 聚类
TidyText Clustering
我想使用 R 和 tidytext
包对相似的词进行聚类。
我已经创建了我的标记,现在想将其转换为矩阵以便对其进行聚类。我想尝试一些令牌技术,看看哪个提供最紧凑的集群。
我的代码如下(取自widyr
包的文档)。我只是无法进行下一步。有人可以帮忙吗?
library(janeaustenr)
library(dplyr)
library(tidytext)
# Comparing Jane Austen novels
austen_words <- austen_books() %>%
unnest_tokens(word, text)
# closest books to each other
closest <- austen_words %>%
pairwise_similarity(book, word, n) %>%
arrange(desc(similarity))
我知道如何围绕 closest
创建聚类算法。
这段代码会让我到达那里,但我不知道如何从上一节转到矩阵。
d <- dist(m)
kfit <- kmeans(d, 4, nstart=100)
您可以通过来自 tidytext 的 casting 为此创建一个合适的矩阵。 cast_
有几个函数,比如cast_sparse()
.
让我们使用四本示例书,并将书中的章节聚集在一起:
library(tidyverse)
library(tidytext)
library(gutenbergr)
my_mirror <- "http://mirrors.xmission.com/gutenberg/"
books <- gutenberg_download(c(36, 158, 164, 345),
meta_fields = "title",
mirror = my_mirror)
books %>%
count(title)
#> # A tibble: 4 x 2
#> title n
#> * <chr> <int>
#> 1 Dracula 15568
#> 2 Emma 16235
#> 3 The War of the Worlds 6474
#> 4 Twenty Thousand Leagues under the Sea 12135
# break apart the chapters
by_chapter <- books %>%
group_by(title) %>%
mutate(chapter = cumsum(str_detect(text, regex("^chapter ",
ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0) %>%
unite(document, title, chapter)
glimpse(by_chapter)
#> Rows: 50,315
#> Columns: 3
#> $ gutenberg_id <int> 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, …
#> $ text <chr> "CHAPTER ONE", "", "THE EVE OF THE WAR", "", "", "No one…
#> $ document <chr> "The War of the Worlds_1", "The War of the Worlds_1", "T…
words_sparse <- by_chapter %>%
unnest_tokens(word, text) %>%
anti_join(get_stopwords(source = "smart")) %>%
count(document, word, sort = TRUE) %>%
cast_sparse(document, word, n)
#> Joining, by = "word"
class(words_sparse)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
dim(words_sparse)
#> [1] 182 18124
words_sparse
对象是通过 cast_sparse()
创建的稀疏矩阵。您可以了解更多关于 converting back and forth from tidy and non-tidy formats for text in this chapter.
现在您有了字数矩阵(即文档术语矩阵,您可以考虑 weighting by tf-idf 而不是计数),您可以使用 kmeans()
。每本书有多少章聚集在一起?
kfit <- kmeans(words_sparse, centers = 4)
enframe(kfit$cluster, value = "cluster") %>%
separate(name, into = c("title", "chapter"), sep = "_") %>%
count(title, cluster) %>%
arrange(cluster)
#> # A tibble: 8 x 3
#> title cluster n
#> <chr> <int> <int>
#> 1 Dracula 1 26
#> 2 The War of the Worlds 1 1
#> 3 Dracula 2 28
#> 4 Emma 2 9
#> 5 The War of the Worlds 2 26
#> 6 Twenty Thousand Leagues under the Sea 2 9
#> 7 Twenty Thousand Leagues under the Sea 3 37
#> 8 Emma 4 46
由 reprex package (v1.0.0)
于 2021-02-04 创建
一丛全艾玛,一丛全海底两万里,一丛全四章书籍。
我想使用 R 和 tidytext
包对相似的词进行聚类。
我已经创建了我的标记,现在想将其转换为矩阵以便对其进行聚类。我想尝试一些令牌技术,看看哪个提供最紧凑的集群。
我的代码如下(取自widyr
包的文档)。我只是无法进行下一步。有人可以帮忙吗?
library(janeaustenr)
library(dplyr)
library(tidytext)
# Comparing Jane Austen novels
austen_words <- austen_books() %>%
unnest_tokens(word, text)
# closest books to each other
closest <- austen_words %>%
pairwise_similarity(book, word, n) %>%
arrange(desc(similarity))
我知道如何围绕 closest
创建聚类算法。
这段代码会让我到达那里,但我不知道如何从上一节转到矩阵。
d <- dist(m)
kfit <- kmeans(d, 4, nstart=100)
您可以通过来自 tidytext 的 casting 为此创建一个合适的矩阵。 cast_
有几个函数,比如cast_sparse()
.
让我们使用四本示例书,并将书中的章节聚集在一起:
library(tidyverse)
library(tidytext)
library(gutenbergr)
my_mirror <- "http://mirrors.xmission.com/gutenberg/"
books <- gutenberg_download(c(36, 158, 164, 345),
meta_fields = "title",
mirror = my_mirror)
books %>%
count(title)
#> # A tibble: 4 x 2
#> title n
#> * <chr> <int>
#> 1 Dracula 15568
#> 2 Emma 16235
#> 3 The War of the Worlds 6474
#> 4 Twenty Thousand Leagues under the Sea 12135
# break apart the chapters
by_chapter <- books %>%
group_by(title) %>%
mutate(chapter = cumsum(str_detect(text, regex("^chapter ",
ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0) %>%
unite(document, title, chapter)
glimpse(by_chapter)
#> Rows: 50,315
#> Columns: 3
#> $ gutenberg_id <int> 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, …
#> $ text <chr> "CHAPTER ONE", "", "THE EVE OF THE WAR", "", "", "No one…
#> $ document <chr> "The War of the Worlds_1", "The War of the Worlds_1", "T…
words_sparse <- by_chapter %>%
unnest_tokens(word, text) %>%
anti_join(get_stopwords(source = "smart")) %>%
count(document, word, sort = TRUE) %>%
cast_sparse(document, word, n)
#> Joining, by = "word"
class(words_sparse)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
dim(words_sparse)
#> [1] 182 18124
words_sparse
对象是通过 cast_sparse()
创建的稀疏矩阵。您可以了解更多关于 converting back and forth from tidy and non-tidy formats for text in this chapter.
现在您有了字数矩阵(即文档术语矩阵,您可以考虑 weighting by tf-idf 而不是计数),您可以使用 kmeans()
。每本书有多少章聚集在一起?
kfit <- kmeans(words_sparse, centers = 4)
enframe(kfit$cluster, value = "cluster") %>%
separate(name, into = c("title", "chapter"), sep = "_") %>%
count(title, cluster) %>%
arrange(cluster)
#> # A tibble: 8 x 3
#> title cluster n
#> <chr> <int> <int>
#> 1 Dracula 1 26
#> 2 The War of the Worlds 1 1
#> 3 Dracula 2 28
#> 4 Emma 2 9
#> 5 The War of the Worlds 2 26
#> 6 Twenty Thousand Leagues under the Sea 2 9
#> 7 Twenty Thousand Leagues under the Sea 3 37
#> 8 Emma 4 46
由 reprex package (v1.0.0)
于 2021-02-04 创建一丛全艾玛,一丛全海底两万里,一丛全四章书籍。