将嵌套列表列方法和 Purrr 与 Tidytext::Unnest_Tokens 一起使用
Using the Nested List Column Approach and Purrr Together with Tidytext::Unnest_Tokens
我有一个数据框,其中包含调查回复,每一行代表不同的人。一栏 - "Text" - 是一个开放式文本问题。我想使用 Tidytext::unnest_tokens 以便按每一行进行文本分析,包括情绪分数、字数等。
这是此示例的简单数据框:
Satisfaction<-c ("Satisfied","Satisfied","Dissatisfied","Satisfied","Dissatisfied")
Text<-c("I'm very satisfied with the services", "Your service providers are always late which causes me a lot of frustration", "You should improve your staff training, service providers have bad customer service","Everything is great!","Service is bad")
Gender<-c("M","M","F","M","F")
df<-data.frame(Satisfaction,Text,Gender)
然后我把文本栏变成了字符...
df$Text<-as.character(df$Text)
接下来我按 id 列分组并嵌套数据框。
df<-df%>%mutate(id=row_number())%>%group_by(id)%>%unnest_tokens(word,Text)%>%nest(-id)
到目前为止似乎工作正常,但现在如何使用 purrr::map 函数处理嵌套列表列 "word"?例如,如果我想使用 dplyr::mutate 创建一个新列,每行都有字数?
此外,是否有更好的嵌套数据框的方法,以便只有 "Text" 列是嵌套列表?
我喜欢使用 purrr::map
来做 modeling for different groups,但是对于你所说的,我认为你可以坚持直接使用 dplyr。
您可以像这样设置数据框:
library(dplyr)
library(tidytext)
Satisfaction <- c("Satisfied",
"Satisfied",
"Dissatisfied",
"Satisfied",
"Dissatisfied")
Text <- c("I'm very satisfied with the services",
"Your service providers are always late which causes me a lot of frustration",
"You should improve your staff training, service providers have bad customer service",
"Everything is great!",
"Service is bad")
Gender <- c("M","M","F","M","F")
df <- data_frame(Satisfaction, Text, Gender)
tidy_df <- df %>%
mutate(id = row_number()) %>%
unnest_tokens(word, Text)
然后要查找,例如每行的字数,您可以使用group_by
和mutate
。
tidy_df %>%
group_by(id) %>%
mutate(num_words = n()) %>%
ungroup
#> # A tibble: 37 × 5
#> Satisfaction Gender id word num_words
#> <chr> <chr> <int> <chr> <int>
#> 1 Satisfied M 1 i'm 6
#> 2 Satisfied M 1 very 6
#> 3 Satisfied M 1 satisfied 6
#> 4 Satisfied M 1 with 6
#> 5 Satisfied M 1 the 6
#> 6 Satisfied M 1 services 6
#> 7 Satisfied M 2 your 13
#> 8 Satisfied M 2 service 13
#> 9 Satisfied M 2 providers 13
#> 10 Satisfied M 2 are 13
#> # ... with 27 more rows
您可以通过实现内连接来进行情绪分析;查看 some examples here.
我有一个数据框,其中包含调查回复,每一行代表不同的人。一栏 - "Text" - 是一个开放式文本问题。我想使用 Tidytext::unnest_tokens 以便按每一行进行文本分析,包括情绪分数、字数等。
这是此示例的简单数据框:
Satisfaction<-c ("Satisfied","Satisfied","Dissatisfied","Satisfied","Dissatisfied")
Text<-c("I'm very satisfied with the services", "Your service providers are always late which causes me a lot of frustration", "You should improve your staff training, service providers have bad customer service","Everything is great!","Service is bad")
Gender<-c("M","M","F","M","F")
df<-data.frame(Satisfaction,Text,Gender)
然后我把文本栏变成了字符...
df$Text<-as.character(df$Text)
接下来我按 id 列分组并嵌套数据框。
df<-df%>%mutate(id=row_number())%>%group_by(id)%>%unnest_tokens(word,Text)%>%nest(-id)
到目前为止似乎工作正常,但现在如何使用 purrr::map 函数处理嵌套列表列 "word"?例如,如果我想使用 dplyr::mutate 创建一个新列,每行都有字数?
此外,是否有更好的嵌套数据框的方法,以便只有 "Text" 列是嵌套列表?
我喜欢使用 purrr::map
来做 modeling for different groups,但是对于你所说的,我认为你可以坚持直接使用 dplyr。
您可以像这样设置数据框:
library(dplyr)
library(tidytext)
Satisfaction <- c("Satisfied",
"Satisfied",
"Dissatisfied",
"Satisfied",
"Dissatisfied")
Text <- c("I'm very satisfied with the services",
"Your service providers are always late which causes me a lot of frustration",
"You should improve your staff training, service providers have bad customer service",
"Everything is great!",
"Service is bad")
Gender <- c("M","M","F","M","F")
df <- data_frame(Satisfaction, Text, Gender)
tidy_df <- df %>%
mutate(id = row_number()) %>%
unnest_tokens(word, Text)
然后要查找,例如每行的字数,您可以使用group_by
和mutate
。
tidy_df %>%
group_by(id) %>%
mutate(num_words = n()) %>%
ungroup
#> # A tibble: 37 × 5
#> Satisfaction Gender id word num_words
#> <chr> <chr> <int> <chr> <int>
#> 1 Satisfied M 1 i'm 6
#> 2 Satisfied M 1 very 6
#> 3 Satisfied M 1 satisfied 6
#> 4 Satisfied M 1 with 6
#> 5 Satisfied M 1 the 6
#> 6 Satisfied M 1 services 6
#> 7 Satisfied M 2 your 13
#> 8 Satisfied M 2 service 13
#> 9 Satisfied M 2 providers 13
#> 10 Satisfied M 2 are 13
#> # ... with 27 more rows
您可以通过实现内连接来进行情绪分析;查看 some examples here.