data.frame 的构造函数中的赋值性能
Assignment performance in data.frame's constructor
我无法理解 data.frame
构造的工作原理。
我看过 this question,但我认为如果您想事后替换数据(重复工作),那么在 data.frame
中预分配列会很慢。
然后我 运行 下面的基准测试,发现将数据作为参数传递给 data.frame
构造函数比只构造 data.frame
然后分配数据要慢得多。
这里发生了什么?
library(microbenchmark)
# Prep -------------------#
n = 1000
s = seq(n)
f = runif(n)
g = as.factor(sample(1:100, size = n, replace = T))
h = runif(n)
i = sample(LETTERS[1:26], size = n, replace = T)
# Functions --------------#
## Construct data.frame and then assign
f_dollar = function(){
d = data.frame(row.names = s,
check.rows = F, check.names = F, stringsAsFactors = F)
d$first = f
d$second = g
d$third = h
d$fourth = i
}
## Construct data.frame assigning named vectors
f_named = function(){
d = data.frame(first = f, second = g, third = h, fourth = i,
check.rows = F, check.names = F, stringsAsFactors = F)
}
## Construct data.frame assigning unnamed vectors
f_unnamed = function(){
d = data.frame(f, g, h, i,
check.rows = F, check.names = F, stringsAsFactors = F)
}
# Profile ----------------#
microbenchmark(f_dollar(), f_named(), f_unnamed())
结果:
Unit: microseconds
expr min lq mean median uq max neval
f_dollar() 65.808 79.691 92.5668 87.3850 100.6715 191.446 100
f_named() 205.962 221.761 245.2758 231.8325 251.2915 538.911 100
f_unnamed() 269.416 283.689 339.8429 297.1045 332.8925 2800.185 100
更改 n=100000
和 运行 您的 microbenchmark()
进行 1000 次试验以消除任何变化会产生以下结果:
> microbenchmark(f_dollar(), f_named(), f_unnamed(), times=1000)
Unit: microseconds
expr min lq mean median uq max neval
f_dollar() 16559.490 17000.361 17444.4909 17282.3785 17587.723 24130.81 1000
f_named() 211.338 233.266 277.4680 254.2595 302.779 2028.94 1000
f_unnamed() 260.325 288.783 391.2701 313.7420 366.693 44304.51 1000
这将支持您的初步印象,即创建包含数据的 data.frame 对象比在据我所知在每个变量追加处重新创建 data.frame 之后添加它要高效得多.
我无法理解 data.frame
构造的工作原理。
我看过 this question,但我认为如果您想事后替换数据(重复工作),那么在 data.frame
中预分配列会很慢。
然后我 运行 下面的基准测试,发现将数据作为参数传递给 data.frame
构造函数比只构造 data.frame
然后分配数据要慢得多。
这里发生了什么?
library(microbenchmark)
# Prep -------------------#
n = 1000
s = seq(n)
f = runif(n)
g = as.factor(sample(1:100, size = n, replace = T))
h = runif(n)
i = sample(LETTERS[1:26], size = n, replace = T)
# Functions --------------#
## Construct data.frame and then assign
f_dollar = function(){
d = data.frame(row.names = s,
check.rows = F, check.names = F, stringsAsFactors = F)
d$first = f
d$second = g
d$third = h
d$fourth = i
}
## Construct data.frame assigning named vectors
f_named = function(){
d = data.frame(first = f, second = g, third = h, fourth = i,
check.rows = F, check.names = F, stringsAsFactors = F)
}
## Construct data.frame assigning unnamed vectors
f_unnamed = function(){
d = data.frame(f, g, h, i,
check.rows = F, check.names = F, stringsAsFactors = F)
}
# Profile ----------------#
microbenchmark(f_dollar(), f_named(), f_unnamed())
结果:
Unit: microseconds
expr min lq mean median uq max neval
f_dollar() 65.808 79.691 92.5668 87.3850 100.6715 191.446 100
f_named() 205.962 221.761 245.2758 231.8325 251.2915 538.911 100
f_unnamed() 269.416 283.689 339.8429 297.1045 332.8925 2800.185 100
更改 n=100000
和 运行 您的 microbenchmark()
进行 1000 次试验以消除任何变化会产生以下结果:
> microbenchmark(f_dollar(), f_named(), f_unnamed(), times=1000)
Unit: microseconds
expr min lq mean median uq max neval
f_dollar() 16559.490 17000.361 17444.4909 17282.3785 17587.723 24130.81 1000
f_named() 211.338 233.266 277.4680 254.2595 302.779 2028.94 1000
f_unnamed() 260.325 288.783 391.2701 313.7420 366.693 44304.51 1000
这将支持您的初步印象,即创建包含数据的 data.frame 对象比在据我所知在每个变量追加处重新创建 data.frame 之后添加它要高效得多.