散列每一行的小标题

Question

我正在使用新创建的 dplyr 1.0.0 和 digest 包来生成 tibble 中每一行的哈希值。

我知道

但我想在 dplyr 1.0.0 中使用改进后的 rowwise()。

参见下面的示例。任何人都知道它为什么失败？我应该被允许消化条目类型不同的一行。

library(dplyr)
library(digest)

df <- tibble(
    student_id = letters[1:4],
    student_id2 = letters[9:12],
    test1 = 10:13, 
    test2 = 20:23, 
    test3 = 30:33, 
    test4 = 40:43
)

df
#> # A tibble: 4 x 6
#>   student_id student_id2 test1 test2 test3 test4
#>   <chr>      <chr>       <int> <int> <int> <int>
#> 1 a          i              10    20    30    40
#> 2 b          j              11    21    31    41
#> 3 c          k              12    22    32    42
#> 4 d          l              13    23    33    43

dd <- df %>%
    rowwise(student_id) %>%
    mutate(hash = digest(c_across(everything()))) %>%
    ungroup
#> Error: Problem with `mutate()` input `hash`.
#> ✖ Can't combine `student_id2` <character> and `test1` <integer>.
#> ℹ Input `hash` is `digest(c_across(everything()))`.
#> ℹ The error occured in row 1.

### but digest should not care too much about the type of the input

^{由 reprex package (v0.3.0)}

于 2020-06-04 创建

Answer 1

不同的列类型似乎有问题。一种选择是首先将列类型更改为单个列类型，然后执行 rowwise

library(dplyr)
library(digest)
df %>%
    mutate(across(everything(), as.character)) %>% 
    rowwise %>%
    mutate(hash = digest(c_across(everything()))) 
# A tibble: 4 x 7
# Rowwise: 
#  student_id student_id2 test1 test2 test3 test4 hash                            
#  <chr>      <chr>       <chr> <chr> <chr> <chr> <chr>                           
#1 a          i           10    20    30    40    2638067de6dcfb3d58b83a83e0cd3089
#2 b          j           11    21    31    41    21162fc0c528a6550b53c87ca0c2805e
#3 c          k           12    22    32    42    8d7539eacff61efbd567b6100227523b
#4 d          l           13    23    33    43    9739997605aa39620ce50e96f1ff4f70

或者另一种选择是 unite 列到单个列，然后在该列上执行 digest

library(tidyr)
df %>% 
   unite(new, everything(), remove = FALSE) %>% 
   rowwise %>%
   mutate(hash = digest(new)) %>%
   select(-new)
# A tibble: 4 x 7
# Rowwise: 
#  student_id student_id2 test1 test2 test3 test4 hash                            
#  <chr>      <chr>       <int> <int> <int> <int> <chr>                           
#1 a          i              10    20    30    40 a9e4cafdfbc88f17b7593dfd684eb2a1
#2 b          j              11    21    31    41 a67a5df8186972285bd7be59e6fdab38
#3 c          k              12    22    32    42 9c20bd87a50642631278b3e6d28ecf68
#4 d          l              13    23    33    43 3f4f373d1969dcf0c8f542023a258225

或者另一种选择是 pmap，其中我们将元素连接到每行中的单个元素，导致 integer 转换为 character，因为 vector 可以只持有一个 class

library(purrr)
df %>% 
     mutate(hash = pmap_chr(., ~ digest(c(...))))
# A tibble: 4 x 7
#  student_id student_id2 test1 test2 test3 test4 hash                            
#  <chr>      <chr>       <int> <int> <int> <int> <chr>                           
#1 a          i              10    20    30    40 f0fb4100907570ef9bda073b78dc44a6
#2 b          j              11    21    31    41 754b09e8d4d854aa5e40aa88d1edfc66
#3 c          k              12    22    32    42 5f3a699caff833e900fd956232cf61dd
#4 d          l              13    23    33    43 4d31c65284e5db36c37461126a9eb63c

这里的优点是我们没有改变列类型

散列每一行的小标题

Hashing every row of a tibble

r

dplyr

rowwise

tibble