如何根据过去事件的存在创建指标函数?
How to create an indicator function based on existence of past occurrences?
我有一个时间序列面板数据集,其结构如下:有多个基金,每个基金拥有多只股票,我们有一个股票价值列。如您所见,面板不平衡。我的实际数据集非常大,每个基金至少有 500 只股票,不同的季度代表一些季度值缺失。
df <- data.frame(
fund_id = c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2),
stock_id = c(1,1,1,1,1,1,2,2,2,2,2,2,2,1,1,3,3,3,3),
year_q = c("2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2011-12","2012-03","2012-06","2012-09",
"2012-12","2013-03","2013-06","2014-09","2015-03","2013-03","2013-06","2013-09","2013-12"),
value = c(1,2,1,3,4,2,1,2,3,4,2,1,3,1,1,3,2,3,1)
)
> df
fund_id stock_id year_q value
1 1 1 2011-03 1
2 1 1 2011-06 2
3 1 1 2011-09 1
4 1 1 2011-12 3
5 1 1 2012-03 4
6 1 1 2012-06 2
7 1 2 2011-12 1
8 1 2 2012-03 2
9 1 2 2012-06 3
10 1 2 2012-09 4
11 1 2 2012-12 2
12 1 2 2013-03 1
13 1 2 2013-06 3
14 2 1 2014-09 1
15 2 1 2015-03 1
16 2 3 2013-03 3
17 2 3 2013-06 2
18 2 3 2013-09 3
19 2 3 2013-12 1
我想创建一个指标函数,如果一只股票出现在该基金的那个季度或过去 3 个季度中的任何一个季度,它就会给出一个 True 值。
这是我正在寻找的结果:
result <- data.frame(
fund_id = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
year_q = c("2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2012-09","2012-12","2013-03","2013-06",
"2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2012-09","2012-12","2013-03","2013-06",
"2013-03","2013-06","2013-09","2013-12","2014-03","2014-06","2014-09","2014-12","2015-03","2013-03",
"2013-06","2013-09","2013-12","2014-03","2014-06","2014-09","2014-12","2015-03"),
stock_id = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,3,3,3,3,3,3,3,3,3),
Indicator = c(T,T,T,T,T,T,T,T,T,F,F,F,F,T,T,T,T,T,T,T,F,F,F,F,F,F,T,T,T,T,T,T,T,T,T,T,F,F)
)
fund_id year_q stock_id Indicator
1 1 2011-03 1 TRUE
2 1 2011-06 1 TRUE
3 1 2011-09 1 TRUE
4 1 2011-12 1 TRUE
5 1 2012-03 1 TRUE
6 1 2012-06 1 TRUE
7 1 2012-09 1 TRUE
8 1 2012-12 1 TRUE
9 1 2013-03 1 TRUE
10 1 2013-06 1 FALSE
11 1 2011-03 2 FALSE
12 1 2011-06 2 FALSE
13 1 2011-09 2 FALSE
14 1 2011-12 2 TRUE
15 1 2012-03 2 TRUE
16 1 2012-06 2 TRUE
17 1 2012-09 2 TRUE
18 1 2012-12 2 TRUE
19 1 2013-03 2 TRUE
20 1 2013-06 2 TRUE
21 2 2013-03 1 FALSE
22 2 2013-06 1 FALSE
23 2 2013-09 1 FALSE
24 2 2013-12 1 FALSE
25 2 2014-03 1 FALSE
26 2 2014-06 1 FALSE
27 2 2014-09 1 TRUE
28 2 2014-12 1 TRUE
29 2 2015-03 1 TRUE
30 2 2013-03 3 TRUE
31 2 2013-06 3 TRUE
32 2 2013-09 3 TRUE
33 2 2013-12 3 TRUE
34 2 2014-03 3 TRUE
35 2 2014-06 3 TRUE
36 2 2014-09 3 TRUE
37 2 2014-12 3 FALSE
38 2 2015-03 3 FALSE
请注意,在某些情况下,季度不是连续的,可能会缺少季度。 (如果这个太难处理也可以无视这个条件)
此外,如果前三个季度中任何一个季度存在的股票不再存在于该基金中,我还想为一个季度创建一个零值。 (虽然这并不重要)。
我一直在尝试多循环解决方案,但由于数据量很大,所以效果不佳。我理想的解决方案是使用 dplyr 或数据表。
我将演示在 dplyr
管道中将 zoo::rollapplyr
用于“本季度和前 3 个季度”(4 宽 window)。我们首先需要“填写”季度,我将通过总结、转换为 Date
,然后 seq
3 个月后,合并回 df
,然后滚动计算。
请注意,与您的 result
相比,在某些情况下,我的“预测”会更进一步。我不知道这是不一致还是只是表明你在估计你想要的 result
看起来像什么。
前面是完整的代码和结果,然后我将逐步介绍它。
library(dplyr)
# library(purrr) # map2
# library(tidyr) # unnest
out <- df %>%
group_by(fund_id) %>%
mutate(miny = min(year_q), maxy = max(year_q)) %>%
distinct(fund_id, stock_id, miny, maxy) %>%
group_by(fund_id, stock_id) %>%
mutate(across(c(miny, maxy), ~ as.Date(paste0(., "-01")))) %>%
transmute(year_q = purrr::map2(miny, maxy, ~ format(seq(.x, .y, by = "3 months"), format = "%Y-%m"))) %>%
tidyr::unnest(year_q) %>%
full_join(df, by = c("fund_id", "stock_id", "year_q")) %>%
arrange(fund_id, stock_id, year_q) %>% # only 'year_q' is strictly required, other 2 are aesthetic
mutate(
Indicator2 = zoo::rollapplyr(value, 4, FUN = function(z) any(!is.na(z)), partial = TRUE)
) %>%
ungroup() %>%
mutate(value = coalesce(value, 0))
out
# # A tibble: 38 x 5
# fund_id stock_id year_q value Indicator2
# <dbl> <dbl> <chr> <dbl> <lgl>
# 1 1 1 2011-03 1 TRUE
# 2 1 1 2011-06 2 TRUE
# 3 1 1 2011-09 1 TRUE
# 4 1 1 2011-12 3 TRUE
# 5 1 1 2012-03 4 TRUE
# 6 1 1 2012-06 2 TRUE
# 7 1 1 2012-09 0 TRUE
# 8 1 1 2012-12 0 TRUE
# 9 1 1 2013-03 0 TRUE
# 10 1 1 2013-06 0 FALSE
# # ... with 28 more rows
快速验证,因为顺序相同,工作起来很方便:
with(full_join(out, result, by = c("fund_id", "stock_id", "year_q")),
identical(Indicator, Indicator2))
# [1] TRUE
Walk-through:
为每个 fund_id
找到 min/max year_q
:
df %>%
group_by(fund_id) %>%
mutate(miny = min(year_q), maxy = max(year_q)) %>%
distinct(fund_id, stock_id, miny, maxy)
# # A tibble: 4 x 4
# # Groups: fund_id [2]
# fund_id stock_id miny maxy
# <dbl> <dbl> <chr> <chr>
# 1 1 1 2011-03 2013-06
# 2 1 2 2011-03 2013-06
# 3 2 1 2013-03 2015-03
# 4 2 3 2013-03 2015-03
“填写”,以便每个 stock_id
都覆盖 fund_id
:
的完整 time-frame
... %>%
group_by(fund_id, stock_id) %>%
mutate(across(c(miny, maxy), ~ as.Date(paste0(., "-01")))) %>%
transmute(year_q = purrr::map2(miny, maxy, ~ format(seq(.x, .y, by = "3 months"), format = "%Y-%m"))) %>%
tidyr::unnest(year_q)
# # A tibble: 38 x 3
# # Groups: fund_id, stock_id [4]
# fund_id stock_id year_q
# <dbl> <dbl> <chr>
# 1 1 1 2011-03
# 2 1 1 2011-06
# 3 1 1 2011-09
# 4 1 1 2011-12
# 5 1 1 2012-03
# 6 1 1 2012-06
# 7 1 1 2012-09
# 8 1 1 2012-12
# 9 1 1 2013-03
# 10 1 1 2013-06
# # ... with 28 more rows
重新加入原始数据。这向我们显示了缺失的季度。
... %>%
full_join(df, by = c("fund_id", "stock_id", "year_q")) %>%
arrange(fund_id, stock_id, year_q)
# # A tibble: 38 x 4
# # Groups: fund_id, stock_id [4]
# fund_id stock_id year_q value
# <dbl> <dbl> <chr> <dbl>
# 1 1 1 2011-03 1
# 2 1 1 2011-06 2
# 3 1 1 2011-09 1
# 4 1 1 2011-12 3
# 5 1 1 2012-03 4
# 6 1 1 2012-06 2
# 7 1 1 2012-09 NA
# 8 1 1 2012-12 NA
# 9 1 1 2013-03 NA
# 10 1 1 2013-06 NA
# # ... with 28 more rows
执行“滚动”计算。这是大部分“工作”:对于每个 value
,我们查看它和前 3 个值(当前的 4 减 1 给我们 3 个前值)并确定是否有非 NA
. partial=TRUE
表示第一个value
(没有之前的),我们单独看;对于第二个 value
,我们查看它和前一个;等等。对于 partial=TRUE
,此 return 与 value
的长度相同;如果 partial=FALSE
(默认值),那么它将 return length(value) - (4-1)
值,这不是我们想要的。
... %>%
# only 'year_q' is strictly required, other 2 are aesthetic
mutate(
Indicator2 = zoo::rollapplyr(value, 4, FUN = function(z) any(!is.na(z)), partial = TRUE)
)
# # A tibble: 38 x 5
# # Groups: fund_id, stock_id [4]
# fund_id stock_id year_q value Indicator2
# <dbl> <dbl> <chr> <dbl> <lgl>
# 1 1 1 2011-03 1 TRUE
# 2 1 1 2011-06 2 TRUE
# 3 1 1 2011-09 1 TRUE
# 4 1 1 2011-12 3 TRUE
# 5 1 1 2012-03 4 TRUE
# 6 1 1 2012-06 2 TRUE
# 7 1 1 2012-09 NA TRUE
# 8 1 1 2012-12 NA TRUE
# 9 1 1 2013-03 NA TRUE
# 10 1 1 2013-06 NA FALSE
# # ... with 28 more rows
最后整理一下(为了方便)把所有的is.na(value)
都转换成0
.
data.table
library(data.table)
DT <- as.data.table(df)
year_qs <- distinct(
DT[, c("miny", "maxy") := .(min(year_q), max(year_q)), by = .(fund_id)
][, .(fund_id, stock_id, miny, maxy)]
)[, c("miny","maxy") := lapply(.SD, function(z) as.Date(paste0(z, "-01"))), .SDcols = c("miny","maxy")
][, .(year_q = Map(function(a, b) format(seq.Date(a, b, by = "3 months"), format = "%Y-%m"), miny, maxy)),
by = .(fund_id, stock_id)
][, tidyr::unnest(.SD, year_q) ]
setDT(year_qs)
DT[, c("miny", "maxy") := NULL]
DT <- rbindlist(list(
DT,
year_qs[!DT, on = .(fund_id, stock_id, year_q)]),
use.names = TRUE, fill = TRUE)
setorder(DT, fund_id, stock_id, year_q)
DT[, Indicator := zoo::rollapplyr(value, 4, FUN = function(z) any(!is.na(z)), partial = TRUE),
by = .(fund_id, stock_id)
][, value := fcoalesce(value, 0)]
我有一个时间序列面板数据集,其结构如下:有多个基金,每个基金拥有多只股票,我们有一个股票价值列。如您所见,面板不平衡。我的实际数据集非常大,每个基金至少有 500 只股票,不同的季度代表一些季度值缺失。
df <- data.frame(
fund_id = c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2),
stock_id = c(1,1,1,1,1,1,2,2,2,2,2,2,2,1,1,3,3,3,3),
year_q = c("2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2011-12","2012-03","2012-06","2012-09",
"2012-12","2013-03","2013-06","2014-09","2015-03","2013-03","2013-06","2013-09","2013-12"),
value = c(1,2,1,3,4,2,1,2,3,4,2,1,3,1,1,3,2,3,1)
)
> df
fund_id stock_id year_q value
1 1 1 2011-03 1
2 1 1 2011-06 2
3 1 1 2011-09 1
4 1 1 2011-12 3
5 1 1 2012-03 4
6 1 1 2012-06 2
7 1 2 2011-12 1
8 1 2 2012-03 2
9 1 2 2012-06 3
10 1 2 2012-09 4
11 1 2 2012-12 2
12 1 2 2013-03 1
13 1 2 2013-06 3
14 2 1 2014-09 1
15 2 1 2015-03 1
16 2 3 2013-03 3
17 2 3 2013-06 2
18 2 3 2013-09 3
19 2 3 2013-12 1
我想创建一个指标函数,如果一只股票出现在该基金的那个季度或过去 3 个季度中的任何一个季度,它就会给出一个 True 值。 这是我正在寻找的结果:
result <- data.frame(
fund_id = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
year_q = c("2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2012-09","2012-12","2013-03","2013-06",
"2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2012-09","2012-12","2013-03","2013-06",
"2013-03","2013-06","2013-09","2013-12","2014-03","2014-06","2014-09","2014-12","2015-03","2013-03",
"2013-06","2013-09","2013-12","2014-03","2014-06","2014-09","2014-12","2015-03"),
stock_id = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,3,3,3,3,3,3,3,3,3),
Indicator = c(T,T,T,T,T,T,T,T,T,F,F,F,F,T,T,T,T,T,T,T,F,F,F,F,F,F,T,T,T,T,T,T,T,T,T,T,F,F)
)
fund_id year_q stock_id Indicator
1 1 2011-03 1 TRUE
2 1 2011-06 1 TRUE
3 1 2011-09 1 TRUE
4 1 2011-12 1 TRUE
5 1 2012-03 1 TRUE
6 1 2012-06 1 TRUE
7 1 2012-09 1 TRUE
8 1 2012-12 1 TRUE
9 1 2013-03 1 TRUE
10 1 2013-06 1 FALSE
11 1 2011-03 2 FALSE
12 1 2011-06 2 FALSE
13 1 2011-09 2 FALSE
14 1 2011-12 2 TRUE
15 1 2012-03 2 TRUE
16 1 2012-06 2 TRUE
17 1 2012-09 2 TRUE
18 1 2012-12 2 TRUE
19 1 2013-03 2 TRUE
20 1 2013-06 2 TRUE
21 2 2013-03 1 FALSE
22 2 2013-06 1 FALSE
23 2 2013-09 1 FALSE
24 2 2013-12 1 FALSE
25 2 2014-03 1 FALSE
26 2 2014-06 1 FALSE
27 2 2014-09 1 TRUE
28 2 2014-12 1 TRUE
29 2 2015-03 1 TRUE
30 2 2013-03 3 TRUE
31 2 2013-06 3 TRUE
32 2 2013-09 3 TRUE
33 2 2013-12 3 TRUE
34 2 2014-03 3 TRUE
35 2 2014-06 3 TRUE
36 2 2014-09 3 TRUE
37 2 2014-12 3 FALSE
38 2 2015-03 3 FALSE
请注意,在某些情况下,季度不是连续的,可能会缺少季度。 (如果这个太难处理也可以无视这个条件)
此外,如果前三个季度中任何一个季度存在的股票不再存在于该基金中,我还想为一个季度创建一个零值。 (虽然这并不重要)。 我一直在尝试多循环解决方案,但由于数据量很大,所以效果不佳。我理想的解决方案是使用 dplyr 或数据表。
我将演示在 dplyr
管道中将 zoo::rollapplyr
用于“本季度和前 3 个季度”(4 宽 window)。我们首先需要“填写”季度,我将通过总结、转换为 Date
,然后 seq
3 个月后,合并回 df
,然后滚动计算。
请注意,与您的 result
相比,在某些情况下,我的“预测”会更进一步。我不知道这是不一致还是只是表明你在估计你想要的 result
看起来像什么。
前面是完整的代码和结果,然后我将逐步介绍它。
library(dplyr)
# library(purrr) # map2
# library(tidyr) # unnest
out <- df %>%
group_by(fund_id) %>%
mutate(miny = min(year_q), maxy = max(year_q)) %>%
distinct(fund_id, stock_id, miny, maxy) %>%
group_by(fund_id, stock_id) %>%
mutate(across(c(miny, maxy), ~ as.Date(paste0(., "-01")))) %>%
transmute(year_q = purrr::map2(miny, maxy, ~ format(seq(.x, .y, by = "3 months"), format = "%Y-%m"))) %>%
tidyr::unnest(year_q) %>%
full_join(df, by = c("fund_id", "stock_id", "year_q")) %>%
arrange(fund_id, stock_id, year_q) %>% # only 'year_q' is strictly required, other 2 are aesthetic
mutate(
Indicator2 = zoo::rollapplyr(value, 4, FUN = function(z) any(!is.na(z)), partial = TRUE)
) %>%
ungroup() %>%
mutate(value = coalesce(value, 0))
out
# # A tibble: 38 x 5
# fund_id stock_id year_q value Indicator2
# <dbl> <dbl> <chr> <dbl> <lgl>
# 1 1 1 2011-03 1 TRUE
# 2 1 1 2011-06 2 TRUE
# 3 1 1 2011-09 1 TRUE
# 4 1 1 2011-12 3 TRUE
# 5 1 1 2012-03 4 TRUE
# 6 1 1 2012-06 2 TRUE
# 7 1 1 2012-09 0 TRUE
# 8 1 1 2012-12 0 TRUE
# 9 1 1 2013-03 0 TRUE
# 10 1 1 2013-06 0 FALSE
# # ... with 28 more rows
快速验证,因为顺序相同,工作起来很方便:
with(full_join(out, result, by = c("fund_id", "stock_id", "year_q")),
identical(Indicator, Indicator2))
# [1] TRUE
Walk-through:
为每个
fund_id
找到 min/maxyear_q
:df %>% group_by(fund_id) %>% mutate(miny = min(year_q), maxy = max(year_q)) %>% distinct(fund_id, stock_id, miny, maxy) # # A tibble: 4 x 4 # # Groups: fund_id [2] # fund_id stock_id miny maxy # <dbl> <dbl> <chr> <chr> # 1 1 1 2011-03 2013-06 # 2 1 2 2011-03 2013-06 # 3 2 1 2013-03 2015-03 # 4 2 3 2013-03 2015-03
“填写”,以便每个
的完整 time-framestock_id
都覆盖fund_id
:... %>% group_by(fund_id, stock_id) %>% mutate(across(c(miny, maxy), ~ as.Date(paste0(., "-01")))) %>% transmute(year_q = purrr::map2(miny, maxy, ~ format(seq(.x, .y, by = "3 months"), format = "%Y-%m"))) %>% tidyr::unnest(year_q) # # A tibble: 38 x 3 # # Groups: fund_id, stock_id [4] # fund_id stock_id year_q # <dbl> <dbl> <chr> # 1 1 1 2011-03 # 2 1 1 2011-06 # 3 1 1 2011-09 # 4 1 1 2011-12 # 5 1 1 2012-03 # 6 1 1 2012-06 # 7 1 1 2012-09 # 8 1 1 2012-12 # 9 1 1 2013-03 # 10 1 1 2013-06 # # ... with 28 more rows
重新加入原始数据。这向我们显示了缺失的季度。
... %>% full_join(df, by = c("fund_id", "stock_id", "year_q")) %>% arrange(fund_id, stock_id, year_q) # # A tibble: 38 x 4 # # Groups: fund_id, stock_id [4] # fund_id stock_id year_q value # <dbl> <dbl> <chr> <dbl> # 1 1 1 2011-03 1 # 2 1 1 2011-06 2 # 3 1 1 2011-09 1 # 4 1 1 2011-12 3 # 5 1 1 2012-03 4 # 6 1 1 2012-06 2 # 7 1 1 2012-09 NA # 8 1 1 2012-12 NA # 9 1 1 2013-03 NA # 10 1 1 2013-06 NA # # ... with 28 more rows
执行“滚动”计算。这是大部分“工作”:对于每个
value
,我们查看它和前 3 个值(当前的 4 减 1 给我们 3 个前值)并确定是否有非NA
.partial=TRUE
表示第一个value
(没有之前的),我们单独看;对于第二个value
,我们查看它和前一个;等等。对于partial=TRUE
,此 return 与value
的长度相同;如果partial=FALSE
(默认值),那么它将 returnlength(value) - (4-1)
值,这不是我们想要的。... %>% # only 'year_q' is strictly required, other 2 are aesthetic mutate( Indicator2 = zoo::rollapplyr(value, 4, FUN = function(z) any(!is.na(z)), partial = TRUE) ) # # A tibble: 38 x 5 # # Groups: fund_id, stock_id [4] # fund_id stock_id year_q value Indicator2 # <dbl> <dbl> <chr> <dbl> <lgl> # 1 1 1 2011-03 1 TRUE # 2 1 1 2011-06 2 TRUE # 3 1 1 2011-09 1 TRUE # 4 1 1 2011-12 3 TRUE # 5 1 1 2012-03 4 TRUE # 6 1 1 2012-06 2 TRUE # 7 1 1 2012-09 NA TRUE # 8 1 1 2012-12 NA TRUE # 9 1 1 2013-03 NA TRUE # 10 1 1 2013-06 NA FALSE # # ... with 28 more rows
最后整理一下(为了方便)把所有的
is.na(value)
都转换成0
.
data.table
library(data.table)
DT <- as.data.table(df)
year_qs <- distinct(
DT[, c("miny", "maxy") := .(min(year_q), max(year_q)), by = .(fund_id)
][, .(fund_id, stock_id, miny, maxy)]
)[, c("miny","maxy") := lapply(.SD, function(z) as.Date(paste0(z, "-01"))), .SDcols = c("miny","maxy")
][, .(year_q = Map(function(a, b) format(seq.Date(a, b, by = "3 months"), format = "%Y-%m"), miny, maxy)),
by = .(fund_id, stock_id)
][, tidyr::unnest(.SD, year_q) ]
setDT(year_qs)
DT[, c("miny", "maxy") := NULL]
DT <- rbindlist(list(
DT,
year_qs[!DT, on = .(fund_id, stock_id, year_q)]),
use.names = TRUE, fill = TRUE)
setorder(DT, fund_id, stock_id, year_q)
DT[, Indicator := zoo::rollapplyr(value, 4, FUN = function(z) any(!is.na(z)), partial = TRUE),
by = .(fund_id, stock_id)
][, value := fcoalesce(value, 0)]