data.frame 的构造函数中的赋值性能

Assignment performance in data.frame's constructor

我无法理解 data.frame 构造的工作原理。

我看过 this question,但我认为如果您想事后替换数据(重复工作),那么在 data.frame 中预分配列会很慢。

然后我 运行 下面的基准测试,发现将数据作为参数传递给 data.frame 构造函数比只构造 data.frame 然后分配数据要慢得多。

这里发生了什么?

library(microbenchmark)


# Prep -------------------#

n = 1000
s = seq(n)

f = runif(n)
g = as.factor(sample(1:100, size = n, replace = T))
h = runif(n)
i = sample(LETTERS[1:26], size = n, replace = T)


# Functions --------------#

## Construct data.frame and then assign
f_dollar = function(){
    d = data.frame(row.names  = s,
                   check.rows = F, check.names = F, stringsAsFactors = F)
    d$first  = f
    d$second = g
    d$third  = h
    d$fourth = i
}


## Construct data.frame assigning named vectors
f_named   = function(){
    d = data.frame(first = f, second = g, third = h, fourth = i,
                   check.rows = F, check.names = F, stringsAsFactors = F)
}

## Construct data.frame assigning unnamed vectors
f_unnamed = function(){
    d = data.frame(f, g, h, i,
                   check.rows = F, check.names = F, stringsAsFactors = F)
}


# Profile ----------------#

microbenchmark(f_dollar(), f_named(), f_unnamed())

结果:

Unit: microseconds
        expr     min      lq     mean   median       uq      max neval
  f_dollar()  65.808  79.691  92.5668  87.3850 100.6715  191.446   100
   f_named() 205.962 221.761 245.2758 231.8325 251.2915  538.911   100
 f_unnamed() 269.416 283.689 339.8429 297.1045 332.8925 2800.185   100

更改 n=100000 和 运行 您的 microbenchmark() 进行 1000 次试验以消除任何变化会产生以下结果:

> microbenchmark(f_dollar(), f_named(), f_unnamed(), times=1000)
Unit: microseconds
        expr       min        lq       mean     median        uq      max neval
  f_dollar() 16559.490 17000.361 17444.4909 17282.3785 17587.723 24130.81  1000
   f_named()   211.338   233.266   277.4680   254.2595   302.779  2028.94  1000
 f_unnamed()   260.325   288.783   391.2701   313.7420   366.693 44304.51  1000

这将支持您的初步印象,即创建包含数据的 data.frame 对象比在据我所知在每个变量追加处重新创建 data.frame 之后添加它要高效得多.