按日期排序的数据框中某些数字的累计和

Cumulative sum of certain numbers in a dataframe ordered by date

我有一个包含日期和值的数据框,我想要一个仅用于正数的 cumsum 和一个仅用于负数的 cumsum。日期有时多次具有相同的日期,然后缺少几天(没有值 = 没有行)

首先我只是测试了一个累计和。这些是累积的,但不是按日期顺序排列的:

df$cumsum <- cumsum(df$values) 
# or
df$cumsum  <- ave(df$values, FUN=cumsum)
# Should cumulate by date but did not in right order
df$cumsum   <- cumsum(df[order(df$date, df$values), "values"])

终于找到了一个解决方案,它按照我的意愿完成了第一步(虽然不是我想在数据框中做的,但是完成了工作):

dt <- data.table(df)
dt[order(date), cumsum := cumsum(values)]

很好,但是每次尝试过滤大于 0 的值都没有成功。最后我对数据进行了子集化并得到了结果,但这并不是我想要的。

dt.pos <- data.table(subset(df, values> 0))
dt.pos[order(date), cumsum := cumsum(values)]

dt.neg <- data.table(subset(df, values < 0))
dt.neg[order(date), cumsum := cumsum(values)]

我正在寻找像 Python 等价物(带有有序数据框)一样简单的东西:

df["cumsum_pos"] = df["values"][df["values"] > 0].cumsum()
df["cumsum_neg"] = df["values"][df["values"] < 0].cumsum()

/编辑

df <- data.frame(date = as.Date(c("2016-12-08", "2016-12-07", "2016-12-05", "2017-01-05", 
                                  "2017-01-10", "2017-01-11", "2017-01-11")),
                 values = c(10, -10, 5, 5, -7, 8, 8))

# just the cumsum
# expected output = c(5, -5, 5, 10, 3, 11, 19)

df$cumsum <- cumsum(df$values)
# output = c(10, 0, 5, 10, 3, 11, 19)

df$cumsum  <- ave(df$values, FUN=cumsum)
# output = c(10, 0, 5, 10, 3, 11, 19)

df$cumsum <- cumsum(df[order(df$date, df$values), "values"])
# output = c(5, -5, 5, 10, 3, 11, 19) correct in this example
# doesn't work with dates in a different order 2016-12-31, 2016-12-30, ... 2015-12-31, 2015-12-30

# Now for just the positives
# expected output = c(10, 0, 5, 15, 15, 23, 31)
df$cumsum.pos[df$values > 0] <- cumsum(df[order(df$date, df$values), "values"][df$values > 0])
# output = c(5, NA, 15, 20, NA, 28, 36)

# And then the same with just the negatives

/编辑

nicolas 评论没有产生正确的输出

df<-df[order(df$date),]
# values = c(5, -10, 10, 5, -7, 8, 8)
# expected output = c(5, 5, 15, 20, 20, 28, 36)
df$cumsum<-ave(df$values,df$values>0,FUN=cumsum)
# output = c(5, -10, 15, 20, -17, 28, 36)

你可以用这个。

library(data.table)
df <- as.data.table(df)

# Order by date
df <- df[order(date)]

# Perform the cumsum for positives and negatives separately
df[, expected := cumsum(values), by = sign(values)]

# Just for the negatives, get the previous positive value
df[, expected := ifelse(values > 0, expected, c(0, expected[-.N]))]

print(df)

         date values expected
1: 2016-12-05      5        5
2: 2016-12-07    -10        5
3: 2016-12-08     10       15
4: 2017-01-05      5       20
5: 2017-01-10     -7       20
6: 2017-01-11      8       28
7: 2017-01-11      8       36

注意,如果连续出现多个负值,则需要重复操作。例如,如果您的数据框是这个:

df <- data.frame(date = as.Date(c("2016-12-08", "2016-12-07", "2016-12-05", "2017-01-05","2017-01-10", "2017-01-10", "2017-01-11", "2017-01-11")), 
values = c(10, -10, 5, 5, -7, -15, 8, 8))

以上代码的单次执行将产生以下输出:

         date values expected
1: 2016-12-05      5        5
2: 2016-12-07    -10        5
3: 2016-12-08     10       15
4: 2017-01-05      5       20
5: 2017-01-10     -7       20
6: 2017-01-10    -15      -17
7: 2017-01-11      8       28
8: 2017-01-11      8       36

值-17 是错误的。为了避免这个问题,您可以重复该过程,直到没有任何负值为止。所以完整的代码是:

df <- df[order(date)]
df[, expected := cumsum(values), by = sign(values)]

# If there are negative values, repeat the process
while(length(which(df$expected < 0))){
  df[, expected := ifelse(values > 0, expected, c(0, expected[-.N]))]
}

print(df)
         date values expected
1: 2016-12-05      5        5
2: 2016-12-07    -10        5
3: 2016-12-08     10       15
4: 2017-01-05      5       20
5: 2017-01-10     -7       20
6: 2017-01-10    -15       20
7: 2017-01-11      8       28
8: 2017-01-11      8       36