在 R 中使用 dplyr 和 digest 向每一行添加散列

Question

我需要为数据集中的每一行添加一个指纹，以便检查同一组的更高版本以寻找差异。

我知道如何为 R 中的每一行添加哈希，如下所示：

data.frame(iris,hash=apply(iris,1,digest))

我正在学习使用 dplyr，因为数据集越来越大，我需要将它们存储在 SQL 服务器中，我尝试了类似下面的方法，但哈希不起作用，所有行都给出相同的哈希值：

iris %>%
  rowwise() %>%
  mutate(hash=digest(.))

关于使用 dplyr 进行逐行散列的任何线索？谢谢！

Answer 1

我们可以使用 do

res <- iris %>%
         rowwise() %>% 
         do(data.frame(., hash = digest(.)))
head(res, 3)
# A tibble: 3 x 6
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species                             hash
#         <dbl>       <dbl>        <dbl>       <dbl>  <fctr>                            <chr>
#1          5.1         3.5          1.4         0.2  setosa e261621c90a9887a85d70aa460127c78
#2          4.9         3.0          1.4         0.2  setosa 7bf67322858048d82e19adb6399ef7a4
#3          4.7         3.2          1.3         0.2  setosa c20f3ee03573aed5929940a29e07a8bb

请注意，在 apply 过程中，所有列都转换为单个 class，因为 apply 转换为 matrix 并且矩阵只能包含单个 class。关于将 factor 转换为 character class

会有警告

Answer 2

由于 do 已被取代，此选项现在可能更好：

library(digest)
library(tidyverse)

# Create a tibble for practice
df <- tibble(x = rep(c(1,2), each=2), y = c(1,1,3,4), z = c(1,1,6,4))

# Note that row 1 and 2 are equal.
# This will generate a sha1 over specific columns (column z is excluded)
df %>% rowwise() %>% mutate(m = sha1( c(x, y ) ))

# This will generate over all columns,
# then convert the hash to integer
# (better for joining or other data operations later)

df %>% 
   rowwise() %>% 
   mutate(sha =
     digest2int( # generates a new integer hash
       sha1( c_across(everything() ) ) # across all columns
     )
   )

将所有内容都转换为字符并将其粘贴在一起以仅使用一次哈希函数调用可能是更好的选择。您可以使用 unite:

df %>% rowwise() %>% 
  unite(allCols, everything(), sep = "", remove = FALSE) %>% 
  mutate(hash = digest2int(allCols)) %>%
  select(-allCols)

在 R 中使用 dplyr 和 digest 向每一行添加散列

adding hash to each row using dplyr and digest in R

r

digest

dplyr