无法获取由逗号分隔的单行数据，该行由另一列值分组

Question

我有一个包含许多变量的数据框，其中两个变量显示在示例数据集 test 中，代码如下：

test <- data.frame(row_numb = c(1,  1,  1,  1,  1,  1,  1,  2,  2,  2,  3,  3,  3,  3,  3,  3,  3,  3),
                   words = c('apply','assistance','benefit','compass','medical','online','renew','meet','service','website','center','country','country','develop','highly','home','major','obtain'))

我正在尝试将单词列中的单词加入新的数据框 fdata 和列 Dictionary，按 row_numb 分组并用 , 逗号分隔使用下面的代码：

fdata <- test %>% 
    select(row_numb, words) %>% 
    group_by(row_numb) %>% 
    unite(Dictionary, words, sep=",")

我无法得到预期的结果：

 row_numb   Dictionary
 1          apply, assistance, benefit, compass, medical, online, renew
 2          meet, service.... and so forth

谁能帮我找出我犯的错误。

Answer 1

unite 用于将多列粘贴在一起，而不是用于聚合一列。为此，将 summarise 与 paste(..., collapse = ', ') 一起使用，或者对于逗号分隔字符串的特定情况，toString:

library(tidyverse)

test <- data.frame(row_numb = c(1,  1,  1,  1,  1,  1,  1,  2,  2,  2,  3,  3,  3,  3,  3,  3,  3,  3),
                   words = c('apply','assistance','benefit','compass','medical','online','renew','meet','service','website','center','country','country','develop','highly','home','major','obtain'))

test %>% group_by(row_numb) %>% summarise(words = toString(words))
#> # A tibble: 3 x 2
#>   row_numb words                                                         
#>      <dbl> <chr>                                                         
#> 1        1 apply, assistance, benefit, compass, medical, online, renew   
#> 2        2 meet, service, website                                        
#> 3        3 center, country, country, develop, highly, home, major, obtain

要使用 unite，请指定新列的名称以及应粘贴在一起的列，可以选择使用 sep 参数，例如

iris %>% unite(sepal_l_w, Sepal.Length, Sepal.Width, sep = ' / ') %>% head()
#>   sepal_l_w Petal.Length Petal.Width Species
#> 1 5.1 / 3.5          1.4         0.2  setosa
#> 2   4.9 / 3          1.4         0.2  setosa
#> 3 4.7 / 3.2          1.3         0.2  setosa
#> 4 4.6 / 3.1          1.5         0.2  setosa
#> 5   5 / 3.6          1.4         0.2  setosa
#> 6 5.4 / 3.9          1.7         0.4  setosa

Answer 2

另一种适用于此类任务的通用模式是 nest()，然后是 mutate()/map()，如果您下一步需要执行的特定任务没有函数喜欢符合要求的 toString()。它仍然只是三行代码：首先 nest() 您的数据，然后展平列表结构，然后 paste/collapse 将它们放在一起。

library(tidyverse)

test %>%
  nest(-row_numb) %>%
  mutate(Dictionary = map(data, unlist),
         Dictionary = map_chr(Dictionary, paste, collapse = ", "))

#> # A tibble: 3 x 3
#>   row_numb data           Dictionary                                      
#>      <dbl> <list>         <chr>                                           
#> 1        1 <tibble [7 × … apply, assistance, benefit, compass, medical, o…
#> 2        2 <tibble [3 × … meet, service, website                          
#> 3        3 <tibble [8 × … center, country, country, develop, highly, home…

由 reprex package (v0.2.0) 创建于 2018-08-14。

无法获取由逗号分隔的单行数据，该行由另一列值分组

failed to get data in single row separated by comma that is grouped by another column values

r

dplyr

tidyr

tidytext