将 R for loop 翻译成 apply 函数
Translate R for loop into apply function
我在代码中写了一个for循环
for(i in 2:nrow(ProductionWellYear2)) {
if (ProductionWellYear2[i,ncol(ProductionWellYear2)] == 0) {
ProductionWellYear2[i, ncol(ProductionWellYear2)] = ProductionWellYear2[i-1,ncol(ProductionWellYear2)] +1}
else {ProductionWellYear2[i,ncol(ProductionWellYear2)] = ProductionWellYear2[i,ncol(ProductionWellYear2)]}
}
但是,这是非常耗时的,因为这个数据框有超过 80 万行。我怎样才能让它更快并避免 for 循环?
这应该对您有用,但在没有看到您的数据的情况下,我无法验证结果是否符合您的要求。话虽这么说,这里的过程与最初编写的过程确实没有太大区别,但基准测试似乎确实表明我的示例数据更快,但不一定“快”。
library(microbenchmark)
# Create fake data
set.seed(1)
ProductionWellYear <- data.frame(A = as.integer(rnorm(2500)),
B = as.integer(rnorm(2500)),
C = as.integer(rnorm(2500))
)
# Copy it to confirm results of both processes are the same
ProductionWellYear2 <- ProductionWellYear
# Slightly modified original version
method1 <- function() {
cols <- ncol(ProductionWellYear)
for(i in 2:nrow(ProductionWellYear)) {
if (ProductionWellYear[i, cols] == 0) {
ProductionWellYear[i, cols] = ProductionWellYear[i - 1, cols] +1
}
else {
ProductionWellYear[i, cols] = ProductionWellYear[i, cols]
}
}
}
# New version
method2 <- function() {
cols <- ncol(ProductionWellYear2)
sapply(2:nrow(ProductionWellYear2), function(i) {
if (ProductionWellYear2[i, cols] == 0) {
ProductionWellYear2[i, cols] <<- ProductionWellYear2[i - 1, cols] +1
}
})
}
# Comparing the outputs
all(ProductionWellYear == ProductionWellYear2)
#[1] TRUE
result <- microbenchmark(method1(), method2())
result
#Unit: milliseconds
# expr min lq mean median uq max neval
# method1() 151.78802 167.3932 190.14905 176.2855 197.60406 337.9904 100
# method2() 45.56065 53.7744 67.55549 59.9299 72.81873 174.1417 100
您可以使用条件赋值,利用 R 作为矢量化语言的潜力。
考虑这个初始数据框:
X1 X2 X3 year
1 1.3709584 -0.09465904 -0.1333213 2014
2 -0.5646982 2.01842371 0.6359504 0
3 0.3631284 -0.06271410 -0.2842529 2016
4 0.6328626 1.30486965 -2.6564554 0
5 0.4042683 2.28664539 -2.4404669 2018
6 -0.1061245 -1.38886070 1.3201133 0
7 1.5115220 -0.27878877 -0.3066386 2020
然后做:
num.col <- ncol(ProductionWellYear2) # to keep code short
ProductionWellYear2[ProductionWellYear2[num.col] == 0, num.col] <-
ProductionWellYear2[which(ProductionWellYear2[num.col] == 0) - 1, num.col] + 1
结果数据框:
X1 X2 X3 year
1 -0.16137564 -1.0344340 -2.18025447 2014
2 0.60828818 1.8149734 1.11955225 2015
3 0.02006922 1.1641742 2.08033131 2016
4 -0.70472925 0.4136222 0.95275587 2017
5 0.43061575 1.0180987 -0.26629157 2018
6 -2.49764918 0.5957401 -2.06162220 2019
7 -1.00775410 1.1497179 -0.03193637 2020
数据:
ProductionWellYear2 <- structure(list(X1 = c(1.37095844714667, -0.564698171396089, 0.363128411337339,
0.63286260496104, 0.404268323140999, -0.106124516091484, 1.51152199743894
), X2 = c(-0.0946590384130976, 2.01842371387704, -0.062714099052421,
1.30486965422349, 2.28664539270111, -1.38886070111234, -0.278788766817371
), X3 = c(-0.133321336393658, 0.635950398070074, -0.284252921416072,
-2.65645542090478, -2.44046692857552, 1.32011334573019, -0.306638594078475
), year = c(2014, 0, 2016, 0, 2018, 0, 2020)), row.names = c(NA,
-7L), class = "data.frame")
我在代码中写了一个for循环
for(i in 2:nrow(ProductionWellYear2)) {
if (ProductionWellYear2[i,ncol(ProductionWellYear2)] == 0) {
ProductionWellYear2[i, ncol(ProductionWellYear2)] = ProductionWellYear2[i-1,ncol(ProductionWellYear2)] +1}
else {ProductionWellYear2[i,ncol(ProductionWellYear2)] = ProductionWellYear2[i,ncol(ProductionWellYear2)]}
}
但是,这是非常耗时的,因为这个数据框有超过 80 万行。我怎样才能让它更快并避免 for 循环?
这应该对您有用,但在没有看到您的数据的情况下,我无法验证结果是否符合您的要求。话虽这么说,这里的过程与最初编写的过程确实没有太大区别,但基准测试似乎确实表明我的示例数据更快,但不一定“快”。
library(microbenchmark)
# Create fake data
set.seed(1)
ProductionWellYear <- data.frame(A = as.integer(rnorm(2500)),
B = as.integer(rnorm(2500)),
C = as.integer(rnorm(2500))
)
# Copy it to confirm results of both processes are the same
ProductionWellYear2 <- ProductionWellYear
# Slightly modified original version
method1 <- function() {
cols <- ncol(ProductionWellYear)
for(i in 2:nrow(ProductionWellYear)) {
if (ProductionWellYear[i, cols] == 0) {
ProductionWellYear[i, cols] = ProductionWellYear[i - 1, cols] +1
}
else {
ProductionWellYear[i, cols] = ProductionWellYear[i, cols]
}
}
}
# New version
method2 <- function() {
cols <- ncol(ProductionWellYear2)
sapply(2:nrow(ProductionWellYear2), function(i) {
if (ProductionWellYear2[i, cols] == 0) {
ProductionWellYear2[i, cols] <<- ProductionWellYear2[i - 1, cols] +1
}
})
}
# Comparing the outputs
all(ProductionWellYear == ProductionWellYear2)
#[1] TRUE
result <- microbenchmark(method1(), method2())
result
#Unit: milliseconds
# expr min lq mean median uq max neval
# method1() 151.78802 167.3932 190.14905 176.2855 197.60406 337.9904 100
# method2() 45.56065 53.7744 67.55549 59.9299 72.81873 174.1417 100
您可以使用条件赋值,利用 R 作为矢量化语言的潜力。
考虑这个初始数据框:
X1 X2 X3 year
1 1.3709584 -0.09465904 -0.1333213 2014
2 -0.5646982 2.01842371 0.6359504 0
3 0.3631284 -0.06271410 -0.2842529 2016
4 0.6328626 1.30486965 -2.6564554 0
5 0.4042683 2.28664539 -2.4404669 2018
6 -0.1061245 -1.38886070 1.3201133 0
7 1.5115220 -0.27878877 -0.3066386 2020
然后做:
num.col <- ncol(ProductionWellYear2) # to keep code short
ProductionWellYear2[ProductionWellYear2[num.col] == 0, num.col] <-
ProductionWellYear2[which(ProductionWellYear2[num.col] == 0) - 1, num.col] + 1
结果数据框:
X1 X2 X3 year
1 -0.16137564 -1.0344340 -2.18025447 2014
2 0.60828818 1.8149734 1.11955225 2015
3 0.02006922 1.1641742 2.08033131 2016
4 -0.70472925 0.4136222 0.95275587 2017
5 0.43061575 1.0180987 -0.26629157 2018
6 -2.49764918 0.5957401 -2.06162220 2019
7 -1.00775410 1.1497179 -0.03193637 2020
数据:
ProductionWellYear2 <- structure(list(X1 = c(1.37095844714667, -0.564698171396089, 0.363128411337339,
0.63286260496104, 0.404268323140999, -0.106124516091484, 1.51152199743894
), X2 = c(-0.0946590384130976, 2.01842371387704, -0.062714099052421,
1.30486965422349, 2.28664539270111, -1.38886070111234, -0.278788766817371
), X3 = c(-0.133321336393658, 0.635950398070074, -0.284252921416072,
-2.65645542090478, -2.44046692857552, 1.32011334573019, -0.306638594078475
), year = c(2014, 0, 2016, 0, 2018, 0, 2020)), row.names = c(NA,
-7L), class = "data.frame")