如何围绕每个符号具有多个事件的变量生成 window?

How to generate a window around a variable with multiple events per Symbol?

我有问题..我的面板数据集看起来像这个没有变量“Window”的数据。现在我正在尝试创建变量“Window”,如下所示:

Symbol  Date        Close       Time      Event  Window
AAPL    09/03/2020  66,542503   16:25:00    NA    NA
AAPL    09/03/2020  71,334999   16:26:00    NA    -4
AAPL    09/03/2020  68,857498   16:27:00    NA    -3
AAPL    09/03/2020  62,057499   16:28:00    NA    -2
AAPL    09/03/2020  69,4925     16:29:00    NA    -1
AAPL    09/03/2020  60,552502   16:30:00    1      0
AAPL    09/03/2020  63,215      16:31:00    NA     1 
AAPL    10/03/2020  61,6675     09:30:00    NA     2 
AAPL    10/03/2020  61,195      09:31:00    NA     3 
AAPL    10/03/2020  57,310001   09:32:00    NA     4  
AAPL    10/03/2020  56,092499   09:33:00    NA    NA 
AAPL    15/03/2020  65,535603   15:45:00    NA    NA
AAPL    15/03/2020  66,357545   15:46:00    NA    NA
AAPL    15/03/2020  62,852345   15:47:00    NA    -4
AAPL    15/03/2020  64,057325   15:48:00    NA    -3
AAPL    16/03/2020  66,494545   09:30:00    NA    -2
AAPL    16/03/2020  63,557967   09:31:00    1     -1
AAPL    16/03/2020  64,415454   09:32:00    NA     0 
AAPL    16/03/2020  62,2357     09:33:00    NA     1 
AAPL    16/03/2020  64,4576     09:34:00    NA     2 
AAPL    16/03/2020  59,457579   09:35:00    NA     3  
AAPL    16/03/2020  58,092470   09:36:00    NA     4 
VISA    05/03/2020  186,960007  16:26:00    NA    NA 
VISA    05/03/2020  184,360001  16:27:00    NA    -4 
VISA    05/03/2020  171,130005  16:28:00    NA    -3 
VISA    05/03/2020  182,600006  16:29:00    NA    -2 
VISA    05/03/2020  172,949997  16:30:00    NA    -1 
VISA    06/03/2020  160,080002  09:32:00    1      0
VISA    06/03/2020  175,830002  09:33:00    NA     1 
VISA    06/03/2020  152,009995  09:34:00    NA     2 
VISA    06/03/2020  157,889999  09:35:00    NA     3 
VISA    06/03/2020  148,479996  09:36:00    NA     4 
VISA    06/03/2020  152,25      09:37:00    NA    NA 
VISA    06/03/2020  146,830002  09:38:00    NA    NA 
VISA    20/03/2020  192,203826  16:12:00    NA    NA 
VISA    20/03/2020  193,293752  16:13:00    NA    -4 
VISA    20/03/2020  192,204726  16:14:00    NA    -3 
VISA    20/03/2020  192,2396    16:15:00    NA    -2 
VISA    20/03/2020  194,185620  16:16:00    NA    -1 
VISA    20/03/2020  196,614289  16:17:00    1      0
VISA    20/03/2020  197,826200  16:18:00    NA     1 
VISA    21/03/2020  197,49176   09:29:00    NA     2 
VISA    21/03/2020  197,239230  09:30:00    NA     3 
VISA    21/03/2020  198,2300    09:31:00    NA     4 
VISA    21/03/2020  198,230028  09:32:00    NA    NA 
VISA    21/03/2020  197,247020  09:33:00    NA    NA 

我已经尝试过我发现的以下代码:

EventStudy <- EventStudy %>% group_by(Symbol) %>% mutate(Window = row_number() - match(1, Event), Window = ifelse(abs(Window) > 4, NA, Window)) %>% ungroup

不幸的是,每个符号只给我一个 Window,但我的代码中每个符号有多个事件。因此,例如,对于符号“AAPL”,我有两个事件。

我也尝试使用没有 group_by 的代码,但它也没有按预期工作。我也没有合适的数据集分组,之后我每组只有一个事件。

对于一个交易品种的多个事件,有没有办法修改代码?你能帮我创建变量“Window”吗?

非常感谢!

这是我不使用 tidyverse 风格的罕见情况之一。我会在滞后上进行一个小的 for 循环(只有 9 次迭代):

## test data
event <- c(NA, NA, NA, NA, NA, 1 , NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1 , NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1 , NA, NA, NA, NA, NA)


event_pos <- which(event == 1)
window <- rep(NA, length(event))

for (dif in -4:4) {
  window[event_pos+dif] <- dif 
}

请注意,此代码不处理特殊情况,例如重叠 windows 或超出数据范围的滞后。但它可以很容易地适应处理这种情况。

结果:

     event window
        NA     NA
        NA     -4
        NA     -3
        NA     -2
        NA     -1
         1      0
        NA      1
        NA      2
        NA      3
        NA      4
        NA     NA
        NA     NA
        NA     -4
        NA     -3
        NA     -2
        NA     -1
         1      0
        NA      1
        NA      2
        NA      3
        NA      4
        NA     NA
        NA     NA
        NA     -4
        NA     -3
        NA     -2
        NA     -1
         1      0
        NA      1
        NA      2
        NA      3
        NA      4
        NA     NA

您可以轻松调整代码以处理交易品种组并处理边缘情况:

library(tidyverse)


create_window <- function(event) {
  
  event_pos <- which(event == 1)
  
  if (length(event_pos) == 0) {
    return(rep(NA, length(event)))
  }
  
  window <- rep(NA, length(event) + 8) ## add 8 for edge cases
  
  for (dif in -4:4) {
    window[event_pos+dif+4] <- dif 
  }
  
  ## remove superfluous 8
  window <- window[-c(1:4, (length(window):(length(window)-3)))]
  
  window
  
}

testdata %>% 
  group_by(symbol) %>% 
  mutate(window = create_window(event)) %>% 
  ungroup()

我的 16GB RAM 和 i5-6600K 机器上 2 亿行 100k 符号和 150 万个事件的一些计时:大约需要 7.2 秒

testdata <-
  tibble(event = rep(NA_real_, 200000000),
         symbol = rep(1:100000, c(rmultinom(1, 200000000, rep(1/100000, 100000)))))

testdata$event[sample.int(length(testdata$event)-9, 1500000)+4] <- 1

microbenchmark::microbenchmark({
  
  testdata %>% 
    group_by(symbol) %>% 
    mutate(window = create_window(event)) %>% 
    ungroup()

  
},
times = 10)

###
##     mean   median       uq      max neval
## 7.650121 7.201488 7.390293 10.21066    10

您可以借助辅助函数来实现此目的,以获得最接近 Event = 1 值的索引。

library(dplyr)

closest_index <- function(x, y) {
  y <- which(y == 1)
  y[sapply(x, function(i) which(abs(y - i) == min(abs(y - i)))[1])]
}

EventStudy %>%
  group_by(Symbol) %>%
  mutate(close_index = closest_index(row_number(), Event),  
         Window = row_number() - close_index, 
         Window = ifelse(abs(Window) > 4, NA, Window)) %>%
  ungroup %>%
  select(-close_index) 

这个returns-

#   Symbol       Date      Close     Time Event Window
#1    AAPL 09/03/2020  66,542503 16:25:00    NA     NA
#2    AAPL 09/03/2020  71,334999 16:26:00    NA     -4
#3    AAPL 09/03/2020  68,857498 16:27:00    NA     -3
#4    AAPL 09/03/2020  62,057499 16:28:00    NA     -2
#5    AAPL 09/03/2020    69,4925 16:29:00    NA     -1
#6    AAPL 09/03/2020  60,552502 16:30:00     1      0
#7    AAPL 09/03/2020     63,215 16:31:00    NA      1
#8    AAPL 10/03/2020    61,6675 09:30:00    NA      2
#9    AAPL 10/03/2020     61,195 09:31:00    NA      3
#10   AAPL 10/03/2020  57,310001 09:32:00    NA      4
#11   AAPL 10/03/2020  56,092499 09:33:00    NA     NA
#12   AAPL 15/03/2020  65,535603 15:45:00    NA     NA
#13   AAPL 15/03/2020  66,357545 15:46:00    NA     -4
#14   AAPL 15/03/2020  62,852345 15:47:00    NA     -3
#15   AAPL 15/03/2020  64,057325 15:48:00    NA     -2
#16   AAPL 16/03/2020  66,494545 09:30:00    NA     -1
#17   AAPL 16/03/2020  63,557967 09:31:00     1      0
#18   AAPL 16/03/2020  64,415454 09:32:00    NA      1
#19   AAPL 16/03/2020    62,2357 09:33:00    NA      2
#20   AAPL 16/03/2020    64,4576 09:34:00    NA      3
#21   AAPL 16/03/2020  59,457579 09:35:00    NA      4
#22   AAPL 16/03/2020  58,092470 09:36:00    NA     NA
#23   VISA 05/03/2020 186,960007 16:26:00    NA     NA
#24   VISA 05/03/2020 184,360001 16:27:00    NA     -4
#25   VISA 05/03/2020 171,130005 16:28:00    NA     -3
#26   VISA 05/03/2020 182,600006 16:29:00    NA     -2
#27   VISA 05/03/2020 172,949997 16:30:00    NA     -1
#28   VISA 06/03/2020 160,080002 09:32:00     1      0
#29   VISA 06/03/2020 175,830002 09:33:00    NA      1
#30   VISA 06/03/2020 152,009995 09:34:00    NA      2
#31   VISA 06/03/2020 157,889999 09:35:00    NA      3
#32   VISA 06/03/2020 148,479996 09:36:00    NA      4
#33   VISA 06/03/2020     152,25 09:37:00    NA     NA
#34   VISA 06/03/2020 146,830002 09:38:00    NA     NA
#35   VISA 20/03/2020 192,203826 16:12:00    NA     NA
#36   VISA 20/03/2020 193,293752 16:13:00    NA     -4
#37   VISA 20/03/2020 192,204726 16:14:00    NA     -3
#38   VISA 20/03/2020   192,2396 16:15:00    NA     -2
#39   VISA 20/03/2020 194,185620 16:16:00    NA     -1
#40   VISA 20/03/2020 196,614289 16:17:00     1      0
#41   VISA 20/03/2020 197,826200 16:18:00    NA      1
#42   VISA 21/03/2020  197,49176 09:29:00    NA      2
#43   VISA 21/03/2020 197,239230 09:30:00    NA      3
#44   VISA 21/03/2020   198,2300 09:31:00    NA      4
#45   VISA 21/03/2020 198,230028 09:32:00    NA     NA
#46   VISA 21/03/2020 197,247020 09:33:00    NA     NA