投资者和资产组合的 Cumsum,条件是在 R 中重新启动
Cumsum on investor and asset combination with a condition to restart in R
我有这个数据框
df <- structure(list(inv = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1", "INV_2", "INV_2", "INV_2", "INV_2", "INV_2" "INV_2"),
ass = c("x", "x", "x", "y" "y", "x", "x", "x", "t", "t", "t"),
datetime = c("2010-01-01", "2010-01-02", "2010-01-03", "2010-01-08", "2010-01-19", "2010-02-20", "2010-02-22", "2010-02-23", "2010-03-01", "2010-03-02", "2010-03-04"),
portfolio = c(10, 0, 2, 2, 0, 5, 5, 5, 3, 0, 2),
G = (1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1),
class = "data.frame", row.names = c(NA, -5L))
代表金融市场的投资者交易,所以我有 4k 个不同的投资者 ID 和 6k 个不同的资产。
我正在搜索的是一种对每个 investor*asset
组合的变量 G
求和的方法。
特别是我希望 cumsum()
在特定 investor*asset
组合与 portfolio == 0
.
配对时重新启动
所以在上面的数据框中我应该得到一个名为 posdays
的新列,它应该等于:
posdays = (1, 1, 0, 0, 0, 1, 2, 3, 1, 1, 1)
其中前 3 个条目指的是 INV_1*X
(注意第三行重新开始计数,因为在之前的 portfolio == 0
中),第四个和第五个条目指向 INV_1*Y
然后
INV_2*X
自 portfolio > 0
以来 G 变量累计 3 次,最后三个引用 INV_2*T
,其中自 portfolio == 0
[= 以来第二次输入后再次重新开始计数26=]
我自己尝试了一些方法,但无法得到我要找的东西。
我的代码是:
res <- res %>%
group_by(group = cumsum(dplyr::lag(portfolio == 0, default = 0))) %>%
mutate(posdays = cumsum(G)) %>%
select(-group) %>%
ungroup
但通过这种方式,我无法根据需要区分投资者和资产。
所以基本上我想我正在寻找一种方法来在前面的代码中添加投资者*资产 group_by 的规范。但我不知道如何,因为我作为 R 用户的经验很低
有什么想法吗?
对于任何感兴趣的人,我已经尝试过这种方法,不确定它是否有效。
res <- res %>%
group_by(investor, asset) %>%
mutate(group = cumsum(dplyr::lag(portfolio == 0, default = 0))) %>%
group_by(investor, asset, group) %>%
mutate(posdays = cumsum(G)) %>%
select(-group) %>%
ungroup
对原始数据框的一些小修正:
df <- structure(
list(
inv = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1", "INV_2", "INV_2", "INV_2", "INV_2", "INV_2", "INV_2"),
ass = c("x", "x", "x", "y", "y", "x", "x", "x", "t", "t", "t"),
datetime = c("2010-01-01", "2010-01-02", "2010-01-03", "2010-01-08", "2010-01-19", "2010-02-20", "2010-02-22", "2010-02-23", "2010-03-01", "2010-03-02", "2010-03-04"),
G = c(1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1),
portfolio = c(10, 0, 2, 2, 0, 5, 5, 5, 3, 0, 2)
),
class = "data.frame", row.names = c(NA, -11L)
)
看来您的想法与您自己的代码是正确的。诀窍是创建一个新列并对其进行分组。在执行 cumsum 之前,请不要忘记确保您的数据已正确排序。
library(dplyr)
df_new <- df |>
arrange(inv, ass, datetime) |>
group_by(inv, ass) |>
mutate(
restart = lag(portfolio == 0, default = FALSE),
group = cumsum(restart)
) |>
group_by(inv, ass, group) |>
mutate(pos_days = cumsum(G))
我有这个数据框
df <- structure(list(inv = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1", "INV_2", "INV_2", "INV_2", "INV_2", "INV_2" "INV_2"),
ass = c("x", "x", "x", "y" "y", "x", "x", "x", "t", "t", "t"),
datetime = c("2010-01-01", "2010-01-02", "2010-01-03", "2010-01-08", "2010-01-19", "2010-02-20", "2010-02-22", "2010-02-23", "2010-03-01", "2010-03-02", "2010-03-04"),
portfolio = c(10, 0, 2, 2, 0, 5, 5, 5, 3, 0, 2),
G = (1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1),
class = "data.frame", row.names = c(NA, -5L))
代表金融市场的投资者交易,所以我有 4k 个不同的投资者 ID 和 6k 个不同的资产。
我正在搜索的是一种对每个 investor*asset
组合的变量 G
求和的方法。
特别是我希望 cumsum()
在特定 investor*asset
组合与 portfolio == 0
.
所以在上面的数据框中我应该得到一个名为 posdays
的新列,它应该等于:
posdays = (1, 1, 0, 0, 0, 1, 2, 3, 1, 1, 1)
其中前 3 个条目指的是 INV_1*X
(注意第三行重新开始计数,因为在之前的 portfolio == 0
中),第四个和第五个条目指向 INV_1*Y
然后
INV_2*X
自 portfolio > 0
以来 G 变量累计 3 次,最后三个引用 INV_2*T
,其中自 portfolio == 0
[= 以来第二次输入后再次重新开始计数26=]
我自己尝试了一些方法,但无法得到我要找的东西。 我的代码是:
res <- res %>%
group_by(group = cumsum(dplyr::lag(portfolio == 0, default = 0))) %>%
mutate(posdays = cumsum(G)) %>%
select(-group) %>%
ungroup
但通过这种方式,我无法根据需要区分投资者和资产。 所以基本上我想我正在寻找一种方法来在前面的代码中添加投资者*资产 group_by 的规范。但我不知道如何,因为我作为 R 用户的经验很低
有什么想法吗?
对于任何感兴趣的人,我已经尝试过这种方法,不确定它是否有效。
res <- res %>%
group_by(investor, asset) %>%
mutate(group = cumsum(dplyr::lag(portfolio == 0, default = 0))) %>%
group_by(investor, asset, group) %>%
mutate(posdays = cumsum(G)) %>%
select(-group) %>%
ungroup
对原始数据框的一些小修正:
df <- structure(
list(
inv = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1", "INV_2", "INV_2", "INV_2", "INV_2", "INV_2", "INV_2"),
ass = c("x", "x", "x", "y", "y", "x", "x", "x", "t", "t", "t"),
datetime = c("2010-01-01", "2010-01-02", "2010-01-03", "2010-01-08", "2010-01-19", "2010-02-20", "2010-02-22", "2010-02-23", "2010-03-01", "2010-03-02", "2010-03-04"),
G = c(1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1),
portfolio = c(10, 0, 2, 2, 0, 5, 5, 5, 3, 0, 2)
),
class = "data.frame", row.names = c(NA, -11L)
)
看来您的想法与您自己的代码是正确的。诀窍是创建一个新列并对其进行分组。在执行 cumsum 之前,请不要忘记确保您的数据已正确排序。
library(dplyr)
df_new <- df |>
arrange(inv, ass, datetime) |>
group_by(inv, ass) |>
mutate(
restart = lag(portfolio == 0, default = FALSE),
group = cumsum(restart)
) |>
group_by(inv, ass, group) |>
mutate(pos_days = cumsum(G))