投资者和资产组合的 Cumsum,条件是在 R 中重新启动

Cumsum on investor and asset combination with a condition to restart in R

我有这个数据框

df <- structure(list(inv = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1", "INV_2", "INV_2", "INV_2", "INV_2", "INV_2" "INV_2"), 
ass = c("x", "x", "x", "y" "y", "x", "x", "x", "t", "t", "t"), 
datetime = c("2010-01-01", "2010-01-02", "2010-01-03", "2010-01-08", "2010-01-19", "2010-02-20", "2010-02-22", "2010-02-23", "2010-03-01", "2010-03-02", "2010-03-04"), 
portfolio = c(10, 0, 2, 2, 0, 5, 5, 5, 3, 0, 2), 
G = (1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1), 
class = "data.frame", row.names = c(NA, -5L))

代表金融市场的投资者交易,所以我有 4k 个不同的投资者 ID 和 6k 个不同的资产。 我正在搜索的是一种对每个 investor*asset 组合的变量 G 求和的方法。 特别是我希望 cumsum() 在特定 investor*asset 组合与 portfolio == 0.

配对时重新启动

所以在上面的数据框中我应该得到一个名为 posdays 的新列,它应该等于:

posdays = (1, 1, 0, 0, 0, 1, 2, 3, 1, 1, 1)

其中前 3 个条目指的是 INV_1*X(注意第三行重新开始计数,因为在之前的 portfolio == 0 中),第四个和第五个条目指向 INV_1*Y 然后 INV_2*Xportfolio > 0 以来 G 变量累计 3 次,最后三个引用 INV_2*T,其中自 portfolio == 0[= 以来第二次输入后再次重新开始计数26=]

我自己尝试了一些方法,但无法得到我要找的东西。 我的代码是:

res <- res %>%
  group_by(group = cumsum(dplyr::lag(portfolio == 0, default = 0))) %>%
  mutate(posdays = cumsum(G)) %>%
  select(-group) %>% 
  ungroup

但通过这种方式,我无法根据需要区分投资者和资产。 所以基本上我想我正在寻找一种方法来在前面的代码中添加投资者*资产 group_by 的规范。但我不知道如何,因为我作为 R 用户的经验很低

有什么想法吗?

对于任何感兴趣的人,我已经尝试过这种方法,不确定它是否有效。

res <- res %>%
  group_by(investor, asset) %>% 
  mutate(group = cumsum(dplyr::lag(portfolio == 0, default = 0))) %>%
  group_by(investor, asset, group) %>% 
  mutate(posdays = cumsum(G)) %>%
  select(-group) %>% 
  ungroup

对原始数据框的一些小修正:

df <- structure(
  list(
    inv = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1", "INV_2", "INV_2", "INV_2", "INV_2", "INV_2", "INV_2"),
    ass = c("x", "x", "x", "y", "y", "x", "x", "x", "t", "t", "t"),
    datetime = c("2010-01-01", "2010-01-02", "2010-01-03", "2010-01-08", "2010-01-19", "2010-02-20", "2010-02-22", "2010-02-23", "2010-03-01", "2010-03-02", "2010-03-04"),
    G = c(1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1),
    portfolio = c(10, 0, 2, 2, 0, 5, 5, 5, 3, 0, 2)
  ),
  class = "data.frame", row.names = c(NA, -11L)
)

看来您的想法与您自己的代码是正确的。诀窍是创建一个新列并对其进行分组。在执行 cumsum 之前,请不要忘记确保您的数据已正确排序。

library(dplyr)

df_new <- df |> 
  arrange(inv, ass, datetime) |> 
  group_by(inv, ass) |> 
  mutate(
    restart = lag(portfolio == 0, default = FALSE),
    group = cumsum(restart)
  ) |> 
  group_by(inv, ass, group) |> 
  mutate(pos_days = cumsum(G))