在数据框内的子组上操作非常慢

Question

我有一个数据框，其中包含有关一周中四天不同人（由 id 列表示）完成的俯卧撑次数的数据。我必须执行以下操作

为每个 id
我想在每一天添加一个列，显示第二天完成的俯卧撑次数。（注意：因为在最后一天，我们不知道第二天做了多少个俯卧撑，所以我们只考虑到第 n-1 行））

我写这篇文章时首先按 (id,dayofweek) 对列进行了“排列”，然后创建了一个临时数据框，我在该数据框上迭代执行了所有这些操作。这样做的问题是，在一个巨大的数据框架上，它非常非常慢。有没有更优雅的方法来做这两件事。请我的代码和下面的输入和输出数据框

输入（整理后）

> df
   id dayofweek pushupcount cumulativepushups nextdaypushupcount
1   1      day1         100                 0                  0
2   1      day2         240                 0                  0
3   1      day3         200                 0                  0
4   1      day4         170                 0                  0
5   2      day1         220                 0                  0
6   2      day2         190                 0                  0
7   2      day3         300                 0                  0
8   2      day4         150                 0                  0
9   3      day1         260                 0                  0
10  3      day2         160                 0                  0
11  3      day3         200                 0                  0
12  3      day4         210                 0                  0

输出

> df
   id dayofweek pushupcount cumulativepushups nextdaypushupcount
1   1      day1         100               100                240
2   1      day2         240               340                200
3   1      day3         200               540                170
5   2      day1         220               220                190
6   2      day2         190               410                300
7   2      day3         300               710                150
9   3      day1         260               260                160
10  3      day2         160               420                200
11  3      day3         200               620                210

正在创建数据

#creating data
id = c(1,2,3,2,1,2,3,1,3,2,1,3)
dayofweek = c('day1','day2','day3','day1','day2','day3','day4','day4','day1','day4','day3','day2')
pushupcount = c(100,190,200,220,240,300,210,170,260,150,200,160)
df =  data.frame(id,dayofweek,pushupcount,stringsAsFactors = FALSE)

代码

#arranding data in increasing order of day of week for each id
library('plyr')
df = arrange(df,id,dayofweek)

#adding the new columns
df$cumulativepushups = 0;
df$nextdaypushupcount = 0;

finaldf = NULL;

#the 'cumulativepushups' column is basically a running sum for each id
#the 'nextdaypushupcount' column is number of pushups for that id for the next day
 (NOTE that since on the last day, we do not know how many pushups were done the next day, we consider only till rows n-1)
uniqueid = unique(df$id)
for(i in 1:length(uniqueid))
{
  tempdf = df[which(df$id == uniqueid[i]),]

  for(j in 1:(nrow(tempdf)-1))
  {
    if(j == 1)
    {
      tempdf[j,]$cumulativepushups = tempdf[j,]$pushupcount
    }
    else
    {
      tempdf[j,]$cumulativepushups = tempdf[j-1,]$cumulativepushups + tempdf[j,]$pushupcount
    }

    tempdf[j,]$nextdaypushupcount = tempdf[j+1,]$pushupcount

    finaldf = rbind(finaldf,tempdf[j,])
  }
}
df = finaldf

谢谢。

Answer 1

你可以试试 dplyr。按 "id"、"dayofweek" (arrange(..)) 对数据集进行排序。按 "id" 分组后，使用 lead 创建 "nextdaypushupcount"。删除最后一个每组观察 (slice(..))。获取 "pushupcount" 的 cumsum 以创建 "cumulativepushups"。

library(dplyr)
df1 <- arrange(df, id, dayofweek)%>%
           group_by(id) %>% 
           mutate(nextdaypushupcount=lead(pushupcount)) %>%
           slice(-n())%>% 
           mutate(cumulativepushups=cumsum(pushupcount))
df1 
 #    id dayofweek pushupcount nextdaypushupcount cumulativepushups
 #1  1      day1           100                240               100
 #2  1      day2           240                200               340
 #3  1      day3           200                170               540
 #4  2      day1           220                190               220
 #5  2      day2           190                300               410
 #6  2      day3           300                150               710
 #7  3      day1           260                160               260
 #8  3      day2           160                200               420
 #9  3      day3           200                210               620

数据

id <- c(1,2,3,2,1,2,3,1,3,2,1,3)
dayofweek <- c('day1','day2','day3','day1','day2','day3','day4','day4',
 'day1','day4','day3','day2')
pushupcount <- c(100,190,200,220,240,300,210,170,260,150,200,160)
df <-  data.frame(id,dayofweek,pushupcount,stringsAsFactors = FALSE)

在数据框内的子组上操作非常慢

operating on sub group inside data frame very slow

r

rstudio

数据