什么是 faster/better：遍历数据帧的每一行或将其拆分为长度为 `nrow` 的列表，R

Question

我只是想知道这是否是一个应该考虑的严肃权衡。假设您在 R 中有一个数据框，并且想要对每个观察（行）执行一个操作。我知道迭代行已经是一个微妙的问题，所以我只是想知道三个选项中的哪一个：

每行的正常 for 循环
将数据帧拆分为 nrow 个元素的列表并对每个元素应用操作并将结果绑定在一起
并行执行上述操作

没有任何基准测试，这基本上就是我在伪代码中要问的：


library(future.apply)

n = 1000000
x = 1:n
y = x + rnorm(n, mean=50, sd=50)

df = data.frame(
  x = x,
  y = y
)

# 1)
# iterating over each row with normal for loop
for(r in 1:nrow(df)){
  row = df[r, ]
  r = f(row)
  df[r, ] = row
}

# 2)
# create a list of length nrow(df) and apply do something to each list element
# and rowbind it together
res = df %>% split(., .$x) %>% lapply(., function(x){
  r = f(x)
})

bind_rows(res, .id="x")  

# 3)
# create a list of length nrow(df) and apply do something to each list element in parallel
# and rowbind it together

res = df %>% split(., .$x) %>% future_lapply(., function(x){
  r = f(x)
})
bind_rows(res, .id="x")

可能 none 以上选项是最好的，所以我很乐意就此提出任何想法。对不起，如果这是一个非常天真的问题。我刚刚开始使用 R。

Answer 1

我经常使用方案tibble %>% nest %>% mutate(map) %>% unnest。看看下面的例子。

library(tidyverse)
n = 10000

f = function(data) sqrt(data$x^2+data$y^2+data$z^2)
tibble(
  x = 1:n,
  y = x + rnorm(n, mean=50, sd=50),
  z = x + y + rnorm(n, mean=50, sd=50)
) %>% nest(data = c(x:z)) %>% 
  mutate(l = map(data, f)) %>% 
  unnest(c(data, l))

输出

# A tibble: 10,000 x 4
       x     y     z     l
   <int> <dbl> <dbl> <dbl>
 1     1  67.1 136.  151. 
 2     2  75.4 127.  148. 
 3     3 -11.1  38.9  40.6
 4     4  58.1 106.  121. 
 5     5  23.5 126.  128. 
 6     6  73.4 179.  193. 
 7     7  44.5 121.  129. 
 8     8 106.  131.  169. 
 9     9  32.5 140.  144. 
10    10 -27.7  82.7  87.8
# ... with 9,990 more rows

就我个人而言，它非常清晰和优雅。但你可以不同意。

更新 1

老实说，您的问题在性能方面也引起了我的兴趣。所以我决定检查一下。这是代码：

library(tidyverse)
library(microbenchmark)

n = 1000
df = tibble(
  x = 1:n,
  y = x + rnorm(n, mean=50, sd=50),
  z = x + y + rnorm(n, mean=50, sd=50)
)

f = function(data) sqrt(data$x^2+data$y^2+data$z^2)

f1 = function(df){
  df %>% nest(data = c(x:z)) %>% 
    mutate(l = map(data, f)) %>% 
    unnest(c(data, l))
}
f1(df)

f2 = function(df){
  df = df %>% mutate(l=NA)
  for(r in 1:nrow(df)){
    row = df[r, ]
    df$l[r] = f(row)
  }
  df
}
f2(df)


f3 = function(df){
  res = df %>% 
    split(., .$x) %>% 
    lapply(., f)
  df %>% bind_cols(l = unlist(res))
}
f3(df)


ggplot2::autoplot(microbenchmark(f1(df), f2(df), f3(df), times=100))

结果如下：我是否需要添加任何其他内容并解释为什么方案 tibble%>% nest%>% mutate (map)%>% unnest 如此酷？

什么是 faster/better：遍历数据帧的每一行或将其拆分为长度为 `nrow` 的列表，R

What is faster/better: Loop over each row of a dataframe or split it into a list of length `nrow` , R

parallel-processing

future

r

dataframe