如何将事件标识符作为虚拟变量添加到面板数据集中?

How to add an Event identifier as a dummy variable to panel dataset?

我的面板数据集有问题。我的数据集如下所示:没有变量“Eventidentifierdummy”。现在我正在尝试创建变量“Eventidentifierdummy”,使其看起来像这样:

有没有办法对每个符号的事件进行编号?因此,例如,每个事件 window 被定义为,例如,在 -4 到 4 之间,并且每个符号发生多次,这是编号的?在某种程度上,作为事件和公司的附加标识符,使变量“Eventidentifierdummy”看起来像这样?

Symbol  Date        Close       Time      Event  Window    Eventidentifierdummy
AAPL    09/03/2020  66,542503   16:25:00    NA    NA        NA
AAPL    09/03/2020  71,334999   16:26:00    NA    -4        1
AAPL    09/03/2020  68,857498   16:27:00    NA    -3        1
AAPL    09/03/2020  62,057499   16:28:00    NA    -2        1
AAPL    09/03/2020  69,4925     16:29:00    NA    -1        1
AAPL    09/03/2020  60,552502   16:30:00    1      0        1
AAPL    09/03/2020  63,215      16:31:00    NA     1        1 
AAPL    10/03/2020  61,6675     09:30:00    NA     2        1
AAPL    10/03/2020  61,195      09:31:00    NA     3        1
AAPL    10/03/2020  57,310001   09:32:00    NA     4        1  
AAPL    10/03/2020  56,092499   09:33:00    NA    NA        NA
AAPL    15/03/2020  65,535603   15:45:00    NA    NA        NA
AAPL    15/03/2020  66,357545   15:46:00    NA    NA        NA
AAPL    15/03/2020  62,852345   15:47:00    NA    -4        2
AAPL    15/03/2020  64,057325   15:48:00    NA    -3        2
AAPL    16/03/2020  66,494545   09:30:00    NA    -2        2
AAPL    16/03/2020  63,557967   09:31:00    1     -1        2
AAPL    16/03/2020  64,415454   09:32:00    NA     0        2
AAPL    16/03/2020  62,2357     09:33:00    NA     1        2
AAPL    16/03/2020  64,4576     09:34:00    NA     2        2
AAPL    16/03/2020  59,457579   09:35:00    NA     3        2
AAPL    16/03/2020  58,092470   09:36:00    NA     4        2
VISA    05/03/2020  186,960007  16:26:00    NA    NA        NA
VISA    05/03/2020  184,360001  16:27:00    NA    -4        1
VISA    05/03/2020  171,130005  16:28:00    NA    -3        1
VISA    05/03/2020  182,600006  16:29:00    NA    -2        1
VISA    05/03/2020  172,949997  16:30:00    NA    -1        1
VISA    06/03/2020  160,080002  09:32:00    1      0        1
VISA    06/03/2020  175,830002  09:33:00    NA     1        1   
VISA    06/03/2020  152,009995  09:34:00    NA     2        1
VISA    06/03/2020  157,889999  09:35:00    NA     3        1
VISA    06/03/2020  148,479996  09:36:00    NA     4        1
VISA    06/03/2020  152,25      09:37:00    NA    NA        NA
VISA    06/03/2020  146,830002  09:38:00    NA    NA        NA
VISA    20/03/2020  192,203826  16:12:00    NA    NA        NA
VISA    20/03/2020  193,293752  16:13:00    NA    -4        2
VISA    20/03/2020  192,204726  16:14:00    NA    -3        2
VISA    20/03/2020  192,2396    16:15:00    NA    -2        2
VISA    20/03/2020  194,185620  16:16:00    NA    -1        2
VISA    20/03/2020  196,614289  16:17:00    1      0        2
VISA    20/03/2020  197,826200  16:18:00    NA     1        2
VISA    21/03/2020  197,49176   09:29:00    NA     2        2
VISA    21/03/2020  197,239230  09:30:00    NA     3        2
VISA    21/03/2020  198,2300    09:31:00    NA     4        2
VISA    21/03/2020  198,230028  09:32:00    NA    NA        NA
VISA    21/03/2020  197,247020  09:33:00    NA    NA        NA

不幸的是,我不确定如何解决这个问题....

每个符号我也有不同的事件,例如有时有两个事件,有时有三个事件。例如,在上述案例中,我有两个针对符号“AAPL”的事件。示例数据集显示了两个符号,但在我的整个数据集中,有超过 1000 个不同的符号及其各自的事件和关联的事件 windows.

我尝试使用命令“count”来解决它,但是它不起作用....

Companies$Window <- as.numeric(Companies$Window)
setDT(Companies)[,Window:= count(Window)][]

你能帮我创建虚拟变量“eventidentifierdummy”吗?

提前致谢!

PS:这是我的数据的 dput():

> dput(Eventdummy)
structure(list(V1 = c("Symbol", "AAPL", "AAPL", "AAPL", "AAPL", 
"AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", 
"AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", 
"AAPL", "AAPL", "VISA", "VISA", "VISA", "VISA", "VISA", "VISA", 
"VISA", "VISA", "VISA", "VISA", "VISA", "VISA", "VISA", "VISA", 
"VISA", "VISA", "VISA", "VISA", "VISA", "VISA", "VISA", "VISA", 
"VISA", "VISA"), V2 = c("Date", "09/03/2020", "09/03/2020", "09/03/2020", 
"09/03/2020", "09/03/2020", "09/03/2020", "09/03/2020", "10/03/2020", 
"10/03/2020", "10/03/2020", "10/03/2020", "15/03/2020", "15/03/2020", 
"15/03/2020", "15/03/2020", "16/03/2020", "16/03/2020", "16/03/2020", 
"16/03/2020", "16/03/2020", "16/03/2020", "16/03/2020", "05/03/2020", 
"05/03/2020", "05/03/2020", "05/03/2020", "05/03/2020", "06/03/2020", 
"06/03/2020", "06/03/2020", "06/03/2020", "06/03/2020", "06/03/2020", 
"06/03/2020", "20/03/2020", "20/03/2020", "20/03/2020", "20/03/2020", 
"20/03/2020", "20/03/2020", "20/03/2020", "21/03/2020", "21/03/2020", 
"21/03/2020", "21/03/2020", "21/03/2020"), V3 = c("Close", "66,542503", 
"71,334999", "68,857498", "62,057499", "69,4925", "60,552502", 
"63,215", "61,6675", "61,195", "57,310001", "56,092499", "65,535603", 
"66,357545", "62,852345", "64,057325", "66,494545", "63,557967", 
"64,415454", "62,2357", "64,4576", "59,457579", "58,092470", 
"186,960007", "184,360001", "171,130005", "182,600006", "172,949997", 
"160,080002", "175,830002", "152,009995", "157,889999", "148,479996", 
"152,25", "146,830002", "192,203826", "193,293752", "192,204726", 
"192,2396", "194,185620", "196,614289", "197,826200", "197,49176", 
"197,239230", "198,2300", "198,230028", "197,247020"), V4 = c("Time", 
"16:25:00", "16:26:00", "16:27:00", "16:28:00", "16:29:00", "16:30:00", 
"16:31:00", "09:30:00", "09:31:00", "09:32:00", "09:33:00", "15:45:00", 
"15:46:00", "15:47:00", "15:48:00", "09:30:00", "09:31:00", "09:32:00", 
"09:33:00", "09:34:00", "09:35:00", "09:36:00", "16:26:00", "16:27:00", 
"16:28:00", "16:29:00", "16:30:00", "09:32:00", "09:33:00", "09:34:00", 
"09:35:00", "09:36:00", "09:37:00", "09:38:00", "16:12:00", "16:13:00", 
"16:14:00", "16:15:00", "16:16:00", "16:17:00", "16:18:00", "09:29:00", 
"09:30:00", "09:31:00", "09:32:00", "09:33:00"), V5 = c("Event", 
NA, NA, NA, NA, NA, "1", NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, "1", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "1", NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, "1", NA, NA, NA, NA, NA, 
NA), V6 = c("Window", NA, "-4", "-3", "-2", "-1", "0", "1", "2", 
"3", "4", NA, NA, NA, "-4", "-3", "-2", "-1", "0", "1", "2", 
"3", "4", NA, "-4", "-3", "-2", "-1", "0", "1", "2", "3", "4", 
NA, NA, NA, "-4", "-3", "-2", "-1", "0", "1", "2", "3", "4", 
NA, NA), V7 = c("Eventidentifierdummy", NA, "1", "1", "1", "1", 
"1", "1", "1", "1", "1", NA, NA, NA, "2", "2", "2", "2", "2", 
"2", "2", "2", "2", NA, "1", "1", "1", "1", "1", "1", "1", "1", 
"1", NA, NA, NA, "2", "2", "2", "2", "2", "2", "2", "2", "2", 
NA, NA)), class = "data.frame", row.names = c(NA, -47L))

您可以使用 rle 创建这样的序列并将其应用于每个 Symbol

library(dplyr)

create_num <- function(x) {
  val <- with(rle(is.na(x)), rep(cumsum(values), lengths))
  val[is.na(x)] <- NA
  val  
}

df %>%
  group_by(Symbol) %>%
  mutate(result = create_num(Window)) %>%
  ungroup 

这个returns-

   Symbol       Date      Close     Time Event Window Eventidentifierdummy result
1    AAPL 09/03/2020  66,542503 16:25:00    NA     NA                   NA     NA
2    AAPL 09/03/2020  71,334999 16:26:00    NA     -4                    1      1
3    AAPL 09/03/2020  68,857498 16:27:00    NA     -3                    1      1
4    AAPL 09/03/2020  62,057499 16:28:00    NA     -2                    1      1
5    AAPL 09/03/2020    69,4925 16:29:00    NA     -1                    1      1
6    AAPL 09/03/2020  60,552502 16:30:00     1      0                    1      1
7    AAPL 09/03/2020     63,215 16:31:00    NA      1                    1      1
8    AAPL 10/03/2020    61,6675 09:30:00    NA      2                    1      1
9    AAPL 10/03/2020     61,195 09:31:00    NA      3                    1      1
10   AAPL 10/03/2020  57,310001 09:32:00    NA      4                    1      1
11   AAPL 10/03/2020  56,092499 09:33:00    NA     NA                   NA     NA
12   AAPL 15/03/2020  65,535603 15:45:00    NA     NA                   NA     NA
13   AAPL 15/03/2020  66,357545 15:46:00    NA     NA                   NA     NA
14   AAPL 15/03/2020  62,852345 15:47:00    NA     -4                    2      2
15   AAPL 15/03/2020  64,057325 15:48:00    NA     -3                    2      2
16   AAPL 16/03/2020  66,494545 09:30:00    NA     -2                    2      2
17   AAPL 16/03/2020  63,557967 09:31:00     1     -1                    2      2
18   AAPL 16/03/2020  64,415454 09:32:00    NA      0                    2      2
19   AAPL 16/03/2020    62,2357 09:33:00    NA      1                    2      2
20   AAPL 16/03/2020    64,4576 09:34:00    NA      2                    2      2
21   AAPL 16/03/2020  59,457579 09:35:00    NA      3                    2      2
22   AAPL 16/03/2020  58,092470 09:36:00    NA      4                    2      2
23   VISA 05/03/2020 186,960007 16:26:00    NA     NA                   NA     NA
24   VISA 05/03/2020 184,360001 16:27:00    NA     -4                    1      1
25   VISA 05/03/2020 171,130005 16:28:00    NA     -3                    1      1
26   VISA 05/03/2020 182,600006 16:29:00    NA     -2                    1      1
27   VISA 05/03/2020 172,949997 16:30:00    NA     -1                    1      1
28   VISA 06/03/2020 160,080002 09:32:00     1      0                    1      1
29   VISA 06/03/2020 175,830002 09:33:00    NA      1                    1      1
30   VISA 06/03/2020 152,009995 09:34:00    NA      2                    1      1
31   VISA 06/03/2020 157,889999 09:35:00    NA      3                    1      1
32   VISA 06/03/2020 148,479996 09:36:00    NA      4                    1      1
33   VISA 06/03/2020     152,25 09:37:00    NA     NA                   NA     NA
34   VISA 06/03/2020 146,830002 09:38:00    NA     NA                   NA     NA
35   VISA 20/03/2020 192,203826 16:12:00    NA     NA                   NA     NA
36   VISA 20/03/2020 193,293752 16:13:00    NA     -4                    2      2
37   VISA 20/03/2020 192,204726 16:14:00    NA     -3                    2      2
38   VISA 20/03/2020   192,2396 16:15:00    NA     -2                    2      2
39   VISA 20/03/2020 194,185620 16:16:00    NA     -1                    2      2
40   VISA 20/03/2020 196,614289 16:17:00     1      0                    2      2
41   VISA 20/03/2020 197,826200 16:18:00    NA      1                    2      2
42   VISA 21/03/2020  197,49176 09:29:00    NA      2                    2      2
43   VISA 21/03/2020 197,239230 09:30:00    NA      3                    2      2
44   VISA 21/03/2020   198,2300 09:31:00    NA      4                    2      2
45   VISA 21/03/2020 198,230028 09:32:00    NA     NA                   NA     NA
46   VISA 21/03/2020 197,247020 09:33:00    NA     NA                   NA     NA

创建了另一个名为 result 的专栏,以便可以比较答案。您可以将 result 替换为 Eventidentifierdummy.

使用 data.table 进行分组,并将事件的开始标识为 -4 值:

library( data.table )
setDT( companies )

companies[ !is.na(Window),
           Eventidentifierdummy := cumsum(Window == -4),
           by = Symbol ]