如何根据过去事件的存在创建指标函数？

Question

我有一个时间序列面板数据集，其结构如下：有多个基金，每个基金拥有多只股票，我们有一个股票价值列。如您所见，面板不平衡。我的实际数据集非常大，每个基金至少有 500 只股票，不同的季度代表一些季度值缺失。

df <- data.frame(
  fund_id = c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2),
  stock_id = c(1,1,1,1,1,1,2,2,2,2,2,2,2,1,1,3,3,3,3),
  year_q = c("2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2011-12","2012-03","2012-06","2012-09",
           "2012-12","2013-03","2013-06","2014-09","2015-03","2013-03","2013-06","2013-09","2013-12"),
  value = c(1,2,1,3,4,2,1,2,3,4,2,1,3,1,1,3,2,3,1)
)


> df
   fund_id stock_id  year_q value
1        1        1 2011-03     1
2        1        1 2011-06     2
3        1        1 2011-09     1
4        1        1 2011-12     3
5        1        1 2012-03     4
6        1        1 2012-06     2
7        1        2 2011-12     1
8        1        2 2012-03     2
9        1        2 2012-06     3
10       1        2 2012-09     4
11       1        2 2012-12     2
12       1        2 2013-03     1
13       1        2 2013-06     3
14       2        1 2014-09     1
15       2        1 2015-03     1
16       2        3 2013-03     3
17       2        3 2013-06     2
18       2        3 2013-09     3
19       2        3 2013-12     1

我想创建一个指标函数，如果一只股票出现在该基金的那个季度或过去 3 个季度中的任何一个季度，它就会给出一个 True 值。这是我正在寻找的结果：

result <- data.frame(
  fund_id = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
  year_q = c("2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2012-09","2012-12","2013-03","2013-06",
             "2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2012-09","2012-12","2013-03","2013-06",
             "2013-03","2013-06","2013-09","2013-12","2014-03","2014-06","2014-09","2014-12","2015-03","2013-03",
             "2013-06","2013-09","2013-12","2014-03","2014-06","2014-09","2014-12","2015-03"),
  stock_id = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,3,3,3,3,3,3,3,3,3),
  Indicator = c(T,T,T,T,T,T,T,T,T,F,F,F,F,T,T,T,T,T,T,T,F,F,F,F,F,F,T,T,T,T,T,T,T,T,T,T,F,F)
)

   fund_id  year_q stock_id Indicator
1        1 2011-03        1      TRUE
2        1 2011-06        1      TRUE
3        1 2011-09        1      TRUE
4        1 2011-12        1      TRUE
5        1 2012-03        1      TRUE
6        1 2012-06        1      TRUE
7        1 2012-09        1      TRUE
8        1 2012-12        1      TRUE
9        1 2013-03        1      TRUE
10       1 2013-06        1     FALSE
11       1 2011-03        2     FALSE
12       1 2011-06        2     FALSE
13       1 2011-09        2     FALSE
14       1 2011-12        2      TRUE
15       1 2012-03        2      TRUE
16       1 2012-06        2      TRUE
17       1 2012-09        2      TRUE
18       1 2012-12        2      TRUE
19       1 2013-03        2      TRUE
20       1 2013-06        2      TRUE
21       2 2013-03        1     FALSE
22       2 2013-06        1     FALSE
23       2 2013-09        1     FALSE
24       2 2013-12        1     FALSE
25       2 2014-03        1     FALSE
26       2 2014-06        1     FALSE
27       2 2014-09        1      TRUE
28       2 2014-12        1      TRUE
29       2 2015-03        1      TRUE
30       2 2013-03        3      TRUE
31       2 2013-06        3      TRUE
32       2 2013-09        3      TRUE
33       2 2013-12        3      TRUE
34       2 2014-03        3      TRUE
35       2 2014-06        3      TRUE
36       2 2014-09        3      TRUE
37       2 2014-12        3     FALSE
38       2 2015-03        3     FALSE

请注意，在某些情况下，季度不是连续的，可能会缺少季度。（如果这个太难处理也可以无视这个条件）

此外，如果前三个季度中任何一个季度存在的股票不再存在于该基金中，我还想为一个季度创建一个零值。（虽然这并不重要）。我一直在尝试多循环解决方案，但由于数据量很大，所以效果不佳。我理想的解决方案是使用 dplyr 或数据表。

Answer 1

我将演示在 dplyr 管道中将 zoo::rollapplyr 用于“本季度和前 3 个季度”（4 宽 window）。我们首先需要“填写”季度，我将通过总结、转换为 Date，然后 seq3 个月后，合并回 df，然后滚动计算。

请注意，与您的 result 相比，在某些情况下，我的“预测”会更进一步。我不知道这是不一致还是只是表明你在估计你想要的 result 看起来像什么。

前面是完整的代码和结果，然后我将逐步介绍它。

library(dplyr)
# library(purrr) # map2
# library(tidyr) # unnest
out <- df %>%
  group_by(fund_id) %>%
  mutate(miny = min(year_q), maxy = max(year_q)) %>%
  distinct(fund_id, stock_id, miny, maxy) %>%
  group_by(fund_id, stock_id) %>%
  mutate(across(c(miny, maxy), ~ as.Date(paste0(., "-01")))) %>%
  transmute(year_q = purrr::map2(miny, maxy, ~ format(seq(.x, .y, by = "3 months"), format = "%Y-%m")))  %>%
  tidyr::unnest(year_q) %>%
  full_join(df, by = c("fund_id", "stock_id", "year_q")) %>%
  arrange(fund_id, stock_id, year_q) %>% # only 'year_q' is strictly required, other 2 are aesthetic
  mutate(
    Indicator2 = zoo::rollapplyr(value, 4, FUN = function(z) any(!is.na(z)), partial = TRUE)
  ) %>%
  ungroup() %>%
  mutate(value = coalesce(value, 0))

out
# # A tibble: 38 x 5
#    fund_id stock_id year_q  value Indicator2
#      <dbl>    <dbl> <chr>   <dbl> <lgl>     
#  1       1        1 2011-03     1 TRUE      
#  2       1        1 2011-06     2 TRUE      
#  3       1        1 2011-09     1 TRUE      
#  4       1        1 2011-12     3 TRUE      
#  5       1        1 2012-03     4 TRUE      
#  6       1        1 2012-06     2 TRUE      
#  7       1        1 2012-09     0 TRUE      
#  8       1        1 2012-12     0 TRUE      
#  9       1        1 2013-03     0 TRUE      
# 10       1        1 2013-06     0 FALSE     
# # ... with 28 more rows

快速验证，因为顺序相同，工作起来很方便：

with(full_join(out, result, by = c("fund_id", "stock_id", "year_q")),
     identical(Indicator, Indicator2))
# [1] TRUE

Walk-through:

为每个 fund_id 找到 min/max year_q:

df %>%
  group_by(fund_id) %>%
  mutate(miny = min(year_q), maxy = max(year_q)) %>%
  distinct(fund_id, stock_id, miny, maxy)
# # A tibble: 4 x 4
# # Groups:   fund_id [2]
#   fund_id stock_id miny    maxy   
#     <dbl>    <dbl> <chr>   <chr>  
# 1       1        1 2011-03 2013-06
# 2       1        2 2011-03 2013-06
# 3       2        1 2013-03 2015-03
# 4       2        3 2013-03 2015-03

“填写”，以便每个 stock_id 都覆盖 fund_id:

的完整 time-frame

... %>%
  group_by(fund_id, stock_id) %>%
  mutate(across(c(miny, maxy), ~ as.Date(paste0(., "-01")))) %>%
  transmute(year_q = purrr::map2(miny, maxy, ~ format(seq(.x, .y, by = "3 months"), format = "%Y-%m")))  %>%
  tidyr::unnest(year_q)
# # A tibble: 38 x 3
# # Groups:   fund_id, stock_id [4]
#    fund_id stock_id year_q 
#      <dbl>    <dbl> <chr>  
#  1       1        1 2011-03
#  2       1        1 2011-06
#  3       1        1 2011-09
#  4       1        1 2011-12
#  5       1        1 2012-03
#  6       1        1 2012-06
#  7       1        1 2012-09
#  8       1        1 2012-12
#  9       1        1 2013-03
# 10       1        1 2013-06
# # ... with 28 more rows

重新加入原始数据。这向我们显示了缺失的季度。

... %>%
  full_join(df, by = c("fund_id", "stock_id", "year_q")) %>%
  arrange(fund_id, stock_id, year_q)
# # A tibble: 38 x 4
# # Groups:   fund_id, stock_id [4]
#    fund_id stock_id year_q  value
#      <dbl>    <dbl> <chr>   <dbl>
#  1       1        1 2011-03     1
#  2       1        1 2011-06     2
#  3       1        1 2011-09     1
#  4       1        1 2011-12     3
#  5       1        1 2012-03     4
#  6       1        1 2012-06     2
#  7       1        1 2012-09    NA
#  8       1        1 2012-12    NA
#  9       1        1 2013-03    NA
# 10       1        1 2013-06    NA
# # ... with 28 more rows

执行“滚动”计算。这是大部分“工作”：对于每个 value，我们查看它和前 3 个值（当前的 4 减 1 给我们 3 个前值）并确定是否有非 NA. partial=TRUE表示第一个value（没有之前的），我们单独看；对于第二个 value，我们查看它和前一个；等等。对于 partial=TRUE，此 return 与 value 的长度相同；如果 partial=FALSE（默认值），那么它将 return length(value) - (4-1) 值，这不是我们想要的。

... %>%
  # only 'year_q' is strictly required, other 2 are aesthetic
  mutate(
    Indicator2 = zoo::rollapplyr(value, 4, FUN = function(z) any(!is.na(z)), partial = TRUE)
  )
# # A tibble: 38 x 5
# # Groups:   fund_id, stock_id [4]
#    fund_id stock_id year_q  value Indicator2
#      <dbl>    <dbl> <chr>   <dbl> <lgl>     
#  1       1        1 2011-03     1 TRUE      
#  2       1        1 2011-06     2 TRUE      
#  3       1        1 2011-09     1 TRUE      
#  4       1        1 2011-12     3 TRUE      
#  5       1        1 2012-03     4 TRUE      
#  6       1        1 2012-06     2 TRUE      
#  7       1        1 2012-09    NA TRUE      
#  8       1        1 2012-12    NA TRUE      
#  9       1        1 2013-03    NA TRUE      
# 10       1        1 2013-06    NA FALSE     
# # ... with 28 more rows

最后整理一下（为了方便）把所有的is.na(value)都转换成0.

data.table

library(data.table)
DT <- as.data.table(df)
year_qs <- distinct(
  DT[, c("miny", "maxy") := .(min(year_q), max(year_q)), by = .(fund_id)
     ][, .(fund_id, stock_id, miny, maxy)]
)[, c("miny","maxy") := lapply(.SD, function(z) as.Date(paste0(z, "-01"))), .SDcols = c("miny","maxy")
  ][, .(year_q = Map(function(a, b) format(seq.Date(a, b, by = "3 months"), format = "%Y-%m"), miny, maxy)),
    by = .(fund_id, stock_id)
    ][, tidyr::unnest(.SD, year_q) ]
setDT(year_qs)
DT[, c("miny", "maxy") := NULL]

DT <- rbindlist(list(
  DT,
  year_qs[!DT, on = .(fund_id, stock_id, year_q)]),
  use.names = TRUE, fill = TRUE)
setorder(DT, fund_id, stock_id, year_q)

DT[, Indicator := zoo::rollapplyr(value, 4, FUN = function(z) any(!is.na(z)), partial = TRUE),
   by = .(fund_id, stock_id)
   ][, value := fcoalesce(value, 0)]

如何根据过去事件的存在创建指标函数？

How to create an indicator function based on existence of past occurrences?

r

panel-data

dplyr

data.table