如何根据过去事件的存在创建指标函数?

How to create an indicator function based on existence of past occurrences?

我有一个时间序列面板数据集,其结构如下:有多个基金,每个基金拥有多只股票,我们有一个股票价值列。如您所见,面板不平衡。我的实际数据集非常大,每个基金至少有 500 只股票,不同的季度代表一些季度值缺失。

df <- data.frame(
  fund_id = c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2),
  stock_id = c(1,1,1,1,1,1,2,2,2,2,2,2,2,1,1,3,3,3,3),
  year_q = c("2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2011-12","2012-03","2012-06","2012-09",
           "2012-12","2013-03","2013-06","2014-09","2015-03","2013-03","2013-06","2013-09","2013-12"),
  value = c(1,2,1,3,4,2,1,2,3,4,2,1,3,1,1,3,2,3,1)
)


> df
   fund_id stock_id  year_q value
1        1        1 2011-03     1
2        1        1 2011-06     2
3        1        1 2011-09     1
4        1        1 2011-12     3
5        1        1 2012-03     4
6        1        1 2012-06     2
7        1        2 2011-12     1
8        1        2 2012-03     2
9        1        2 2012-06     3
10       1        2 2012-09     4
11       1        2 2012-12     2
12       1        2 2013-03     1
13       1        2 2013-06     3
14       2        1 2014-09     1
15       2        1 2015-03     1
16       2        3 2013-03     3
17       2        3 2013-06     2
18       2        3 2013-09     3
19       2        3 2013-12     1

我想创建一个指标函数,如果一只股票出现在该基金的那个季度或过去 3 个季度中的任何一个季度,它就会给出一个 True 值。 这是我正在寻找的结果:

result <- data.frame(
  fund_id = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
  year_q = c("2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2012-09","2012-12","2013-03","2013-06",
             "2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2012-09","2012-12","2013-03","2013-06",
             "2013-03","2013-06","2013-09","2013-12","2014-03","2014-06","2014-09","2014-12","2015-03","2013-03",
             "2013-06","2013-09","2013-12","2014-03","2014-06","2014-09","2014-12","2015-03"),
  stock_id = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,3,3,3,3,3,3,3,3,3),
  Indicator = c(T,T,T,T,T,T,T,T,T,F,F,F,F,T,T,T,T,T,T,T,F,F,F,F,F,F,T,T,T,T,T,T,T,T,T,T,F,F)
)

   fund_id  year_q stock_id Indicator
1        1 2011-03        1      TRUE
2        1 2011-06        1      TRUE
3        1 2011-09        1      TRUE
4        1 2011-12        1      TRUE
5        1 2012-03        1      TRUE
6        1 2012-06        1      TRUE
7        1 2012-09        1      TRUE
8        1 2012-12        1      TRUE
9        1 2013-03        1      TRUE
10       1 2013-06        1     FALSE
11       1 2011-03        2     FALSE
12       1 2011-06        2     FALSE
13       1 2011-09        2     FALSE
14       1 2011-12        2      TRUE
15       1 2012-03        2      TRUE
16       1 2012-06        2      TRUE
17       1 2012-09        2      TRUE
18       1 2012-12        2      TRUE
19       1 2013-03        2      TRUE
20       1 2013-06        2      TRUE
21       2 2013-03        1     FALSE
22       2 2013-06        1     FALSE
23       2 2013-09        1     FALSE
24       2 2013-12        1     FALSE
25       2 2014-03        1     FALSE
26       2 2014-06        1     FALSE
27       2 2014-09        1      TRUE
28       2 2014-12        1      TRUE
29       2 2015-03        1      TRUE
30       2 2013-03        3      TRUE
31       2 2013-06        3      TRUE
32       2 2013-09        3      TRUE
33       2 2013-12        3      TRUE
34       2 2014-03        3      TRUE
35       2 2014-06        3      TRUE
36       2 2014-09        3      TRUE
37       2 2014-12        3     FALSE
38       2 2015-03        3     FALSE

请注意,在某些情况下,季度不是连续的,可能会缺少季度。 (如果这个太难处理也可以无视这个条件)

此外,如果前三个季度中任何一个季度存在的股票不再存在于该基金中,我还想为一个季度创建一个零值。 (虽然这并不重要)。 我一直在尝试多循环解决方案,但由于数据量很大,所以效果不佳。我理想的解决方案是使用 dplyr 或数据表。

我将演示在 dplyr 管道中将 zoo::rollapplyr 用于“本季度和前 3 个季度”(4 宽 window)。我们首先需要“填写”季度,我将通过总结、转换为 Date,然后 seq3 个月后,合并回 df,然后滚动计算。

请注意,与您的 result 相比,在某些情况下,我的“预测”会更进一步。我不知道这是不一致还是只是表明你在估计你想要的 result 看起来像什么。

前面是完整的代码和结果,然后我将逐步介绍它。

library(dplyr)
# library(purrr) # map2
# library(tidyr) # unnest
out <- df %>%
  group_by(fund_id) %>%
  mutate(miny = min(year_q), maxy = max(year_q)) %>%
  distinct(fund_id, stock_id, miny, maxy) %>%
  group_by(fund_id, stock_id) %>%
  mutate(across(c(miny, maxy), ~ as.Date(paste0(., "-01")))) %>%
  transmute(year_q = purrr::map2(miny, maxy, ~ format(seq(.x, .y, by = "3 months"), format = "%Y-%m")))  %>%
  tidyr::unnest(year_q) %>%
  full_join(df, by = c("fund_id", "stock_id", "year_q")) %>%
  arrange(fund_id, stock_id, year_q) %>% # only 'year_q' is strictly required, other 2 are aesthetic
  mutate(
    Indicator2 = zoo::rollapplyr(value, 4, FUN = function(z) any(!is.na(z)), partial = TRUE)
  ) %>%
  ungroup() %>%
  mutate(value = coalesce(value, 0))

out
# # A tibble: 38 x 5
#    fund_id stock_id year_q  value Indicator2
#      <dbl>    <dbl> <chr>   <dbl> <lgl>     
#  1       1        1 2011-03     1 TRUE      
#  2       1        1 2011-06     2 TRUE      
#  3       1        1 2011-09     1 TRUE      
#  4       1        1 2011-12     3 TRUE      
#  5       1        1 2012-03     4 TRUE      
#  6       1        1 2012-06     2 TRUE      
#  7       1        1 2012-09     0 TRUE      
#  8       1        1 2012-12     0 TRUE      
#  9       1        1 2013-03     0 TRUE      
# 10       1        1 2013-06     0 FALSE     
# # ... with 28 more rows

快速验证,因为顺序相同,工作起来很方便:

with(full_join(out, result, by = c("fund_id", "stock_id", "year_q")),
     identical(Indicator, Indicator2))
# [1] TRUE

Walk-through:

  1. 为每个 fund_id 找到 min/max year_q:

    df %>%
      group_by(fund_id) %>%
      mutate(miny = min(year_q), maxy = max(year_q)) %>%
      distinct(fund_id, stock_id, miny, maxy)
    # # A tibble: 4 x 4
    # # Groups:   fund_id [2]
    #   fund_id stock_id miny    maxy   
    #     <dbl>    <dbl> <chr>   <chr>  
    # 1       1        1 2011-03 2013-06
    # 2       1        2 2011-03 2013-06
    # 3       2        1 2013-03 2015-03
    # 4       2        3 2013-03 2015-03
    
  2. “填写”,以便每个 stock_id 都覆盖 fund_id:

    的完整 time-frame
    ... %>%
      group_by(fund_id, stock_id) %>%
      mutate(across(c(miny, maxy), ~ as.Date(paste0(., "-01")))) %>%
      transmute(year_q = purrr::map2(miny, maxy, ~ format(seq(.x, .y, by = "3 months"), format = "%Y-%m")))  %>%
      tidyr::unnest(year_q)
    # # A tibble: 38 x 3
    # # Groups:   fund_id, stock_id [4]
    #    fund_id stock_id year_q 
    #      <dbl>    <dbl> <chr>  
    #  1       1        1 2011-03
    #  2       1        1 2011-06
    #  3       1        1 2011-09
    #  4       1        1 2011-12
    #  5       1        1 2012-03
    #  6       1        1 2012-06
    #  7       1        1 2012-09
    #  8       1        1 2012-12
    #  9       1        1 2013-03
    # 10       1        1 2013-06
    # # ... with 28 more rows
    
  3. 重新加入原始数据。这向我们显示了缺失的季度。

    ... %>%
      full_join(df, by = c("fund_id", "stock_id", "year_q")) %>%
      arrange(fund_id, stock_id, year_q)
    # # A tibble: 38 x 4
    # # Groups:   fund_id, stock_id [4]
    #    fund_id stock_id year_q  value
    #      <dbl>    <dbl> <chr>   <dbl>
    #  1       1        1 2011-03     1
    #  2       1        1 2011-06     2
    #  3       1        1 2011-09     1
    #  4       1        1 2011-12     3
    #  5       1        1 2012-03     4
    #  6       1        1 2012-06     2
    #  7       1        1 2012-09    NA
    #  8       1        1 2012-12    NA
    #  9       1        1 2013-03    NA
    # 10       1        1 2013-06    NA
    # # ... with 28 more rows
    
  4. 执行“滚动”计算。这是大部分“工作”:对于每个 value,我们查看它和前 3 个值(当前的 4 减 1 给我们 3 个前值)并确定是否有非 NA. partial=TRUE表示第一个value(没有之前的),我们单独看;对于第二个 value,我们查看它和前一个;等等。对于 partial=TRUE,此 return 与 value 的长度相同;如果 partial=FALSE(默认值),那么它将 return length(value) - (4-1) 值,这不是我们想要的。

    ... %>%
      # only 'year_q' is strictly required, other 2 are aesthetic
      mutate(
        Indicator2 = zoo::rollapplyr(value, 4, FUN = function(z) any(!is.na(z)), partial = TRUE)
      )
    # # A tibble: 38 x 5
    # # Groups:   fund_id, stock_id [4]
    #    fund_id stock_id year_q  value Indicator2
    #      <dbl>    <dbl> <chr>   <dbl> <lgl>     
    #  1       1        1 2011-03     1 TRUE      
    #  2       1        1 2011-06     2 TRUE      
    #  3       1        1 2011-09     1 TRUE      
    #  4       1        1 2011-12     3 TRUE      
    #  5       1        1 2012-03     4 TRUE      
    #  6       1        1 2012-06     2 TRUE      
    #  7       1        1 2012-09    NA TRUE      
    #  8       1        1 2012-12    NA TRUE      
    #  9       1        1 2013-03    NA TRUE      
    # 10       1        1 2013-06    NA FALSE     
    # # ... with 28 more rows
    
  5. 最后整理一下(为了方便)把所有的is.na(value)都转换成0.


data.table

library(data.table)
DT <- as.data.table(df)
year_qs <- distinct(
  DT[, c("miny", "maxy") := .(min(year_q), max(year_q)), by = .(fund_id)
     ][, .(fund_id, stock_id, miny, maxy)]
)[, c("miny","maxy") := lapply(.SD, function(z) as.Date(paste0(z, "-01"))), .SDcols = c("miny","maxy")
  ][, .(year_q = Map(function(a, b) format(seq.Date(a, b, by = "3 months"), format = "%Y-%m"), miny, maxy)),
    by = .(fund_id, stock_id)
    ][, tidyr::unnest(.SD, year_q) ]
setDT(year_qs)
DT[, c("miny", "maxy") := NULL]

DT <- rbindlist(list(
  DT,
  year_qs[!DT, on = .(fund_id, stock_id, year_q)]),
  use.names = TRUE, fill = TRUE)
setorder(DT, fund_id, stock_id, year_q)

DT[, Indicator := zoo::rollapplyr(value, 4, FUN = function(z) any(!is.na(z)), partial = TRUE),
   by = .(fund_id, stock_id)
   ][, value := fcoalesce(value, 0)]