tidytext:unnest_tokens 和 token = 'ngrams' 问题
tidytext: Issue with unnest_tokens and token = 'ngrams'
我是运行下面的代码
library(rwhatsapp)
library(tidytext)
chat <- rwa_read(x = c(
"31/1/15 04:10:59 - Menganito: Was it good?",
"31/1/15 14:10:59 - Fulanito: Yes, it was"
))
chat %>% as_tibble() %>%
unnest_tokens(output = bigram, input = text, token = "ngrams", n = 2)
但我收到以下错误:
Error in unnest_tokens.data.frame(., output = bigram, input = text, token = "ngrams", :
If collapse = TRUE (such as for unnesting by sentence or paragraph), unnest_tokens needs all input columns to be atomic vectors (not lists)
我尝试对 Google 进行一些研究,但找不到答案。 text
列是一个字符向量,所以我不明白为什么我收到一条错误消息说它不是。
问题是因为有一些 list
列是 NULL
str(chat)
#tibble [2 × 6] (S3: tbl_df/tbl/data.frame)
# $ time : POSIXct[1:2], format: "2015-01-31 04:10:59" "2015-01-31 14:10:59"
# $ author : Factor w/ 2 levels "Fulanito","Menganito": 2 1
# $ text : chr [1:2] "Was it good?" "Yes, it was"
# $ source : chr [1:2] "text input" "text input"
# $ emoji :List of 2 ###
# ..$ : NULL
# ..$ : NULL
# $ emoji_name:List of 2 ###
# ..$ : NULL
# ..$ : NULL
我们可以删除它,现在可以使用了
library(rwhatsapp)
library(tidytext)
chat %>%
select_if(~ !is.list(.)) %>%
unnest_tokens(output = bigram, input = text, token = "ngrams", n = 2)
# A tibble: 4 x 4
# time author source bigram
# <dttm> <fct> <chr> <chr>
#1 2015-01-31 04:10:59 Menganito text input was it
#2 2015-01-31 04:10:59 Menganito text input it good
#3 2015-01-31 14:10:59 Fulanito text input yes it
#4 2015-01-31 14:10:59 Fulanito text input it was
此外,默认情况下 collapse=TRUE
,这会在有 NULL
个元素时产生问题,因为当元素为 collapse
d 时,长度会有所不同。一种选择是指定 collapse = FALSE
chat %>%
unnest_tokens(output = bigram, input = text, token = "ngrams",
n = 2, collapse= FALSE)
# A tibble: 4 x 6
# time author source emoji emoji_name bigram
# <dttm> <fct> <chr> <list> <list> <chr>
#1 2015-01-31 04:10:59 Menganito text input <NULL> <NULL> was it
#2 2015-01-31 04:10:59 Menganito text input <NULL> <NULL> it good
#3 2015-01-31 14:10:59 Fulanito text input <NULL> <NULL> yes it
#4 2015-01-31 14:10:59 Fulanito text input <NULL> <NULL> it was
我是运行下面的代码
library(rwhatsapp)
library(tidytext)
chat <- rwa_read(x = c(
"31/1/15 04:10:59 - Menganito: Was it good?",
"31/1/15 14:10:59 - Fulanito: Yes, it was"
))
chat %>% as_tibble() %>%
unnest_tokens(output = bigram, input = text, token = "ngrams", n = 2)
但我收到以下错误:
Error in unnest_tokens.data.frame(., output = bigram, input = text, token = "ngrams", :
If collapse = TRUE (such as for unnesting by sentence or paragraph), unnest_tokens needs all input columns to be atomic vectors (not lists)
我尝试对 Google 进行一些研究,但找不到答案。 text
列是一个字符向量,所以我不明白为什么我收到一条错误消息说它不是。
问题是因为有一些 list
列是 NULL
str(chat)
#tibble [2 × 6] (S3: tbl_df/tbl/data.frame)
# $ time : POSIXct[1:2], format: "2015-01-31 04:10:59" "2015-01-31 14:10:59"
# $ author : Factor w/ 2 levels "Fulanito","Menganito": 2 1
# $ text : chr [1:2] "Was it good?" "Yes, it was"
# $ source : chr [1:2] "text input" "text input"
# $ emoji :List of 2 ###
# ..$ : NULL
# ..$ : NULL
# $ emoji_name:List of 2 ###
# ..$ : NULL
# ..$ : NULL
我们可以删除它,现在可以使用了
library(rwhatsapp)
library(tidytext)
chat %>%
select_if(~ !is.list(.)) %>%
unnest_tokens(output = bigram, input = text, token = "ngrams", n = 2)
# A tibble: 4 x 4
# time author source bigram
# <dttm> <fct> <chr> <chr>
#1 2015-01-31 04:10:59 Menganito text input was it
#2 2015-01-31 04:10:59 Menganito text input it good
#3 2015-01-31 14:10:59 Fulanito text input yes it
#4 2015-01-31 14:10:59 Fulanito text input it was
此外,默认情况下 collapse=TRUE
,这会在有 NULL
个元素时产生问题,因为当元素为 collapse
d 时,长度会有所不同。一种选择是指定 collapse = FALSE
chat %>%
unnest_tokens(output = bigram, input = text, token = "ngrams",
n = 2, collapse= FALSE)
# A tibble: 4 x 6
# time author source emoji emoji_name bigram
# <dttm> <fct> <chr> <list> <list> <chr>
#1 2015-01-31 04:10:59 Menganito text input <NULL> <NULL> was it
#2 2015-01-31 04:10:59 Menganito text input <NULL> <NULL> it good
#3 2015-01-31 14:10:59 Fulanito text input <NULL> <NULL> yes it
#4 2015-01-31 14:10:59 Fulanito text input <NULL> <NULL> it was