在数据框内的子组上操作非常慢
operating on sub group inside data frame very slow
我有一个数据框,其中包含有关一周中四天不同人(由 id 列表示)完成的俯卧撑次数的数据。我必须执行以下操作
- 为每个 id
找到俯卧撑的 运行 总和(累计成本)
- 我想在每一天添加一个列,显示第二天完成的俯卧撑次数。 (注意:因为在最后一天,我们不知道第二天做了多少个俯卧撑,所以我们只考虑到第 n-1 行))
我写这篇文章时首先按 (id,dayofweek) 对列进行了“排列”,然后创建了一个临时数据框,我在该数据框上迭代执行了所有这些操作。这样做的问题是,在一个巨大的数据框架上,它非常非常慢。有没有更优雅的方法来做这两件事。请我的代码和下面的输入和输出数据框
输入(整理后)
> df
id dayofweek pushupcount cumulativepushups nextdaypushupcount
1 1 day1 100 0 0
2 1 day2 240 0 0
3 1 day3 200 0 0
4 1 day4 170 0 0
5 2 day1 220 0 0
6 2 day2 190 0 0
7 2 day3 300 0 0
8 2 day4 150 0 0
9 3 day1 260 0 0
10 3 day2 160 0 0
11 3 day3 200 0 0
12 3 day4 210 0 0
输出
> df
id dayofweek pushupcount cumulativepushups nextdaypushupcount
1 1 day1 100 100 240
2 1 day2 240 340 200
3 1 day3 200 540 170
5 2 day1 220 220 190
6 2 day2 190 410 300
7 2 day3 300 710 150
9 3 day1 260 260 160
10 3 day2 160 420 200
11 3 day3 200 620 210
正在创建数据
#creating data
id = c(1,2,3,2,1,2,3,1,3,2,1,3)
dayofweek = c('day1','day2','day3','day1','day2','day3','day4','day4','day1','day4','day3','day2')
pushupcount = c(100,190,200,220,240,300,210,170,260,150,200,160)
df = data.frame(id,dayofweek,pushupcount,stringsAsFactors = FALSE)
代码
#arranding data in increasing order of day of week for each id
library('plyr')
df = arrange(df,id,dayofweek)
#adding the new columns
df$cumulativepushups = 0;
df$nextdaypushupcount = 0;
finaldf = NULL;
#the 'cumulativepushups' column is basically a running sum for each id
#the 'nextdaypushupcount' column is number of pushups for that id for the next day
(NOTE that since on the last day, we do not know how many pushups were done the next day, we consider only till rows n-1)
uniqueid = unique(df$id)
for(i in 1:length(uniqueid))
{
tempdf = df[which(df$id == uniqueid[i]),]
for(j in 1:(nrow(tempdf)-1))
{
if(j == 1)
{
tempdf[j,]$cumulativepushups = tempdf[j,]$pushupcount
}
else
{
tempdf[j,]$cumulativepushups = tempdf[j-1,]$cumulativepushups + tempdf[j,]$pushupcount
}
tempdf[j,]$nextdaypushupcount = tempdf[j+1,]$pushupcount
finaldf = rbind(finaldf,tempdf[j,])
}
}
df = finaldf
谢谢。
你可以试试 dplyr
。按 "id"、"dayofweek" (arrange(..)
) 对数据集进行排序。按 "id" 分组后,使用 lead
创建 "nextdaypushupcount"。删除最后一个
每组观察 (slice(..)
)。获取 "pushupcount" 的 cumsum
以创建 "cumulativepushups"。
library(dplyr)
df1 <- arrange(df, id, dayofweek)%>%
group_by(id) %>%
mutate(nextdaypushupcount=lead(pushupcount)) %>%
slice(-n())%>%
mutate(cumulativepushups=cumsum(pushupcount))
df1
# id dayofweek pushupcount nextdaypushupcount cumulativepushups
#1 1 day1 100 240 100
#2 1 day2 240 200 340
#3 1 day3 200 170 540
#4 2 day1 220 190 220
#5 2 day2 190 300 410
#6 2 day3 300 150 710
#7 3 day1 260 160 260
#8 3 day2 160 200 420
#9 3 day3 200 210 620
数据
id <- c(1,2,3,2,1,2,3,1,3,2,1,3)
dayofweek <- c('day1','day2','day3','day1','day2','day3','day4','day4',
'day1','day4','day3','day2')
pushupcount <- c(100,190,200,220,240,300,210,170,260,150,200,160)
df <- data.frame(id,dayofweek,pushupcount,stringsAsFactors = FALSE)
我有一个数据框,其中包含有关一周中四天不同人(由 id 列表示)完成的俯卧撑次数的数据。我必须执行以下操作
- 为每个 id 找到俯卧撑的 运行 总和(累计成本)
- 我想在每一天添加一个列,显示第二天完成的俯卧撑次数。 (注意:因为在最后一天,我们不知道第二天做了多少个俯卧撑,所以我们只考虑到第 n-1 行))
我写这篇文章时首先按 (id,dayofweek) 对列进行了“排列”,然后创建了一个临时数据框,我在该数据框上迭代执行了所有这些操作。这样做的问题是,在一个巨大的数据框架上,它非常非常慢。有没有更优雅的方法来做这两件事。请我的代码和下面的输入和输出数据框
输入(整理后)
> df
id dayofweek pushupcount cumulativepushups nextdaypushupcount
1 1 day1 100 0 0
2 1 day2 240 0 0
3 1 day3 200 0 0
4 1 day4 170 0 0
5 2 day1 220 0 0
6 2 day2 190 0 0
7 2 day3 300 0 0
8 2 day4 150 0 0
9 3 day1 260 0 0
10 3 day2 160 0 0
11 3 day3 200 0 0
12 3 day4 210 0 0
输出
> df
id dayofweek pushupcount cumulativepushups nextdaypushupcount
1 1 day1 100 100 240
2 1 day2 240 340 200
3 1 day3 200 540 170
5 2 day1 220 220 190
6 2 day2 190 410 300
7 2 day3 300 710 150
9 3 day1 260 260 160
10 3 day2 160 420 200
11 3 day3 200 620 210
正在创建数据
#creating data
id = c(1,2,3,2,1,2,3,1,3,2,1,3)
dayofweek = c('day1','day2','day3','day1','day2','day3','day4','day4','day1','day4','day3','day2')
pushupcount = c(100,190,200,220,240,300,210,170,260,150,200,160)
df = data.frame(id,dayofweek,pushupcount,stringsAsFactors = FALSE)
代码
#arranding data in increasing order of day of week for each id
library('plyr')
df = arrange(df,id,dayofweek)
#adding the new columns
df$cumulativepushups = 0;
df$nextdaypushupcount = 0;
finaldf = NULL;
#the 'cumulativepushups' column is basically a running sum for each id
#the 'nextdaypushupcount' column is number of pushups for that id for the next day
(NOTE that since on the last day, we do not know how many pushups were done the next day, we consider only till rows n-1)
uniqueid = unique(df$id)
for(i in 1:length(uniqueid))
{
tempdf = df[which(df$id == uniqueid[i]),]
for(j in 1:(nrow(tempdf)-1))
{
if(j == 1)
{
tempdf[j,]$cumulativepushups = tempdf[j,]$pushupcount
}
else
{
tempdf[j,]$cumulativepushups = tempdf[j-1,]$cumulativepushups + tempdf[j,]$pushupcount
}
tempdf[j,]$nextdaypushupcount = tempdf[j+1,]$pushupcount
finaldf = rbind(finaldf,tempdf[j,])
}
}
df = finaldf
谢谢。
你可以试试 dplyr
。按 "id"、"dayofweek" (arrange(..)
) 对数据集进行排序。按 "id" 分组后,使用 lead
创建 "nextdaypushupcount"。删除最后一个
每组观察 (slice(..)
)。获取 "pushupcount" 的 cumsum
以创建 "cumulativepushups"。
library(dplyr)
df1 <- arrange(df, id, dayofweek)%>%
group_by(id) %>%
mutate(nextdaypushupcount=lead(pushupcount)) %>%
slice(-n())%>%
mutate(cumulativepushups=cumsum(pushupcount))
df1
# id dayofweek pushupcount nextdaypushupcount cumulativepushups
#1 1 day1 100 240 100
#2 1 day2 240 200 340
#3 1 day3 200 170 540
#4 2 day1 220 190 220
#5 2 day2 190 300 410
#6 2 day3 300 150 710
#7 3 day1 260 160 260
#8 3 day2 160 200 420
#9 3 day3 200 210 620
数据
id <- c(1,2,3,2,1,2,3,1,3,2,1,3)
dayofweek <- c('day1','day2','day3','day1','day2','day3','day4','day4',
'day1','day4','day3','day2')
pushupcount <- c(100,190,200,220,240,300,210,170,260,150,200,160)
df <- data.frame(id,dayofweek,pushupcount,stringsAsFactors = FALSE)