R重写for循环

R rewriting a for loop

我的代码中有一个循环,我想重写它,这样 运行 代码的竞争时间会少一些。我知道你总是必须避免代码中的循环,但我想不出另一种方法来实现我的目标。

所以我得到了一个数据集 "df_1531",其中包含大量数据,我需要使用 subset() 将其分割成多个部分(如果有人知道更好的方法,请告诉我 ;))。我有一个包含 21 个变量名称的向量,我喜欢在其上分配 df_1531 的子集。此外,该脚本包含 22 个带约束的变量 (shift_XY_time)。

现在这是我的代码...

# list containing different slots
shift_time_list<- c(startdate, shift_1m_time, shift_1a_time, shift_1n_time,
                               shift_2m_time, shift_2a_time, shift_2n_time,
                               shift_3m_time, shift_3a_time, shift_3n_time,
                               shift_4m_time, shift_4a_time, shift_4n_time, 
                               shift_5m_time, shift_5a_time, shift_5n_time,
                               shift_6m_time, shift_6a_time, shift_6n_time,
                               shift_7m_time, shift_7a_time, shift_7n_time)
# List with subset names 
shift_sub_list <- c("shift_1m_sub", "shift_1a_sub", "shift_1n_sub",
                    "shift_2m_sub", "shift_2a_sub", "shift_2n_sub",
                    "shift_3m_sub", "shift_3a_sub", "shift_3n_sub",
                    "shift_4m_sub", "shift_4a_sub", "shift_4n_sub", 
                    "shift_5m_sub", "shift_5a_sub", "shift_5n_sub",
                    "shift_6m_sub", "shift_6a_sub", "shift_6n_sub",
                    "shift_7m_sub", "shift_7a_sub", "shift_7n_sub")

# The actual loop that I'd like to rewrite
for (i in 1:21) {
  assign(shift_sub_list[i], subset(df_1531, df_1531$'PLS FFM' >= shift_time_list[i] & df_1531$'PLS FFM' < shift_time_list[i+1]))
}

运行 循环大约需要 6 或 7 秒。因此,如果有人知道 better/cleaner 或更快的方法来编写我的代码,我非常想听听您的 suggestion/opinion。

**可重现的例子**

mydata <- cars

dput(cars)
structure(list(speed = c(4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 
                         12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 16, 
                         16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 
                         22, 23, 24, 24, 24, 24, 25), dist = c(2, 10, 4, 22, 16, 10, 18, 
                                                               26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34, 46, 26, 36, 60, 80, 
                                                               20, 26, 54, 32, 40, 32, 40, 50, 42, 56, 76, 84, 36, 46, 68, 32, 
                                                               48, 52, 56, 64, 66, 54, 70, 92, 93, 120, 85)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                  -50L))

dist_interval_list <- c(  0,   5,  10,  15,
                         20,  25,  30,  35, 
                         40,  45,  50,  55, 
                         60,  65,  70,  75,
                         80,  85,  90,  95,
                        100, 105, 110, 115, 120)


var_name_list <- c("var_name_1a", "var_name_1b", "var_name_1c", "var_name_1d",
                    "var_name_2a", "var_name_2b", "var_name_2c", "var_name_2d",
                    "var_name_3a", "var_name_3b", "var_name_3c", "var_name_3d",
                    "var_name_4a", "var_name_4b", "var_name_4c", "var_name_4d",
                    "var_name_5a", "var_name_5b", "var_name_5c", "var_name_5d",
                    "var_name_6a", "var_name_6b", "var_name_6c", "var_name_6d")


for (i in 1:24){
  assign(var_name_list[i], subset(mydata,
                                       mydata$dist >= dist_interval_list[i] & 
                                       mydata$dist < dist_interval_list[i+1]))
}

从'reproducible'部分开始,最终目的是总结另一列的信息,可以利用区间不重叠的事实,只需使用cut功能。

library(tidyverse)

mydata %>% 
  mutate(interval = cut(dist, breaks = dist_interval_list)) %>% 
  group_by(interval) %>% 
  summarise(sum = sum(speed))

这应该会快得多,并且还可以帮助您避免在充满变量(实际上是数据的一部分)的混乱环境中迷失方向。您希望尽可能长时间地将所有数据保存在一个数据框中;)如果您的函数不适用于数据框,您可能希望在最后的建模步骤中使用 purrrlyr::invoke_rows 之类的东西。