unnest_tokens 的对面
Opposite of unnest_tokens
这很可能是一个愚蠢的问题,但我用谷歌搜索了又搜索却找不到解决方案。我认为这是因为我不知道用正确的方式表达我要搜索的问题。
我有一个数据框,我已经在 R 中将其转换为整洁的文本格式以去除停用词。我现在想 'untidy' 将该数据框恢复到其原始格式。
unnest_tokens的相反/反命令是什么?
编辑:这是我正在处理的数据的样子。我正在尝试复制 Silge 和 Robinson 的 Tidy Text 书中的分析,但使用的是意大利歌剧剧本。
character = c("FIGARO", "SUSANNA", "CONTE", "CHERUBINO")
line = c("Cinque... dieci.... venti... trenta... trentasei...quarantatre", "Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello.", "Susanna, mi sembri agitata e confusa.", "Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia!")
sample_df = data.frame(character, line)
sample_df
character line
FIGARO Cinque... dieci.... venti... trenta... trentasei...quarantatre
SUSANNA Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello.
CONTE Susanna, mi sembri agitata e confusa.
CHERUBINO Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia!
我把它变成整洁的文本,这样我就可以去掉停用词:
tribble <- sample_df %>%
unnest_tokens(word, line)
# Get rid of stop words
# I had to make my own list of stop words for 18th century Italian opera
itstopwords <- data_frame(text=mystopwords)
names(itstopwords)[names(itstopwords)=="text"] <- "word"
tribble2 <- tribble %>%
anti_join(itstopwords)
现在我有这样的东西:
text word
FIGARO cinque
FIGARO dieci
FIGARO venti
FIGARO trenta
...
我想把它恢复成角色名和关联行的格式,看看其他的东西。基本上我希望文本的格式与以前相同,但删除了停用词。
这不是一个愚蠢的问题!答案在一定程度上取决于您要执行的操作,但如果我想在整理后的形式进行一些处理后使用 group_by()
函数将我的文本恢复为原始形式,这将是我的典型方法dplyr.
首先,让我们从原始文本到整理过的格式。
library(tidyverse)
library(tidytext)
tidy_austen <- janeaustenr::austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text)
tidy_austen
#> # A tibble: 725,055 x 3
#> book linenumber word
#> <fct> <int> <chr>
#> 1 Sense & Sensibility 1 sense
#> 2 Sense & Sensibility 1 and
#> 3 Sense & Sensibility 1 sensibility
#> 4 Sense & Sensibility 3 by
#> 5 Sense & Sensibility 3 jane
#> 6 Sense & Sensibility 3 austen
#> 7 Sense & Sensibility 5 1811
#> 8 Sense & Sensibility 10 chapter
#> 9 Sense & Sensibility 10 1
#> 10 Sense & Sensibility 13 the
#> # … with 725,045 more rows
现在文字很整洁!但我们可以解开它,恢复到某种类似于其原始形式的状态。我通常使用来自 dplyr 的 group_by()
和 summarize()
以及来自 stringr 的 str_c()
来解决这个问题。在这种特殊情况下,文本最后是什么样的?
tidy_austen %>%
group_by(book, linenumber) %>%
summarize(text = str_c(word, collapse = " ")) %>%
ungroup()
#> # A tibble: 62,272 x 3
#> book linenumber text
#> <fct> <int> <chr>
#> 1 Sense & Sensib… 1 sense and sensibility
#> 2 Sense & Sensib… 3 by jane austen
#> 3 Sense & Sensib… 5 1811
#> 4 Sense & Sensib… 10 chapter 1
#> 5 Sense & Sensib… 13 the family of dashwood had long been settled…
#> 6 Sense & Sensib… 14 was large and their residence was at norland…
#> 7 Sense & Sensib… 15 their property where for many generations th…
#> 8 Sense & Sensib… 16 respectable a manner as to engage the genera…
#> 9 Sense & Sensib… 17 surrounding acquaintance the late owner of t…
#> 10 Sense & Sensib… 18 man who lived to a very advanced age and who…
#> # … with 62,262 more rows
由 reprex package (v0.3.0)
于 2019-07-11 创建
library(tidyverse)
tidy_austen %>%
group_by(book,linenumber) %>%
summarise(text = str_c(word, collapse = " "))
这很可能是一个愚蠢的问题,但我用谷歌搜索了又搜索却找不到解决方案。我认为这是因为我不知道用正确的方式表达我要搜索的问题。
我有一个数据框,我已经在 R 中将其转换为整洁的文本格式以去除停用词。我现在想 'untidy' 将该数据框恢复到其原始格式。
unnest_tokens的相反/反命令是什么?
编辑:这是我正在处理的数据的样子。我正在尝试复制 Silge 和 Robinson 的 Tidy Text 书中的分析,但使用的是意大利歌剧剧本。
character = c("FIGARO", "SUSANNA", "CONTE", "CHERUBINO")
line = c("Cinque... dieci.... venti... trenta... trentasei...quarantatre", "Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello.", "Susanna, mi sembri agitata e confusa.", "Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia!")
sample_df = data.frame(character, line)
sample_df
character line
FIGARO Cinque... dieci.... venti... trenta... trentasei...quarantatre
SUSANNA Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello.
CONTE Susanna, mi sembri agitata e confusa.
CHERUBINO Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia!
我把它变成整洁的文本,这样我就可以去掉停用词:
tribble <- sample_df %>%
unnest_tokens(word, line)
# Get rid of stop words
# I had to make my own list of stop words for 18th century Italian opera
itstopwords <- data_frame(text=mystopwords)
names(itstopwords)[names(itstopwords)=="text"] <- "word"
tribble2 <- tribble %>%
anti_join(itstopwords)
现在我有这样的东西:
text word
FIGARO cinque
FIGARO dieci
FIGARO venti
FIGARO trenta
...
我想把它恢复成角色名和关联行的格式,看看其他的东西。基本上我希望文本的格式与以前相同,但删除了停用词。
这不是一个愚蠢的问题!答案在一定程度上取决于您要执行的操作,但如果我想在整理后的形式进行一些处理后使用 group_by()
函数将我的文本恢复为原始形式,这将是我的典型方法dplyr.
首先,让我们从原始文本到整理过的格式。
library(tidyverse)
library(tidytext)
tidy_austen <- janeaustenr::austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text)
tidy_austen
#> # A tibble: 725,055 x 3
#> book linenumber word
#> <fct> <int> <chr>
#> 1 Sense & Sensibility 1 sense
#> 2 Sense & Sensibility 1 and
#> 3 Sense & Sensibility 1 sensibility
#> 4 Sense & Sensibility 3 by
#> 5 Sense & Sensibility 3 jane
#> 6 Sense & Sensibility 3 austen
#> 7 Sense & Sensibility 5 1811
#> 8 Sense & Sensibility 10 chapter
#> 9 Sense & Sensibility 10 1
#> 10 Sense & Sensibility 13 the
#> # … with 725,045 more rows
现在文字很整洁!但我们可以解开它,恢复到某种类似于其原始形式的状态。我通常使用来自 dplyr 的 group_by()
和 summarize()
以及来自 stringr 的 str_c()
来解决这个问题。在这种特殊情况下,文本最后是什么样的?
tidy_austen %>%
group_by(book, linenumber) %>%
summarize(text = str_c(word, collapse = " ")) %>%
ungroup()
#> # A tibble: 62,272 x 3
#> book linenumber text
#> <fct> <int> <chr>
#> 1 Sense & Sensib… 1 sense and sensibility
#> 2 Sense & Sensib… 3 by jane austen
#> 3 Sense & Sensib… 5 1811
#> 4 Sense & Sensib… 10 chapter 1
#> 5 Sense & Sensib… 13 the family of dashwood had long been settled…
#> 6 Sense & Sensib… 14 was large and their residence was at norland…
#> 7 Sense & Sensib… 15 their property where for many generations th…
#> 8 Sense & Sensib… 16 respectable a manner as to engage the genera…
#> 9 Sense & Sensib… 17 surrounding acquaintance the late owner of t…
#> 10 Sense & Sensib… 18 man who lived to a very advanced age and who…
#> # … with 62,262 more rows
由 reprex package (v0.3.0)
于 2019-07-11 创建library(tidyverse)
tidy_austen %>%
group_by(book,linenumber) %>%
summarise(text = str_c(word, collapse = " "))