为 R 中数据框中的每一列创建接下来的 4 个主要特征

Creating next 4 lead features for every column in a data frame in R

我有一个非常大的数据框,其中包含 100 多个国家/地区的多个读数(多列),每个读数 98 天。

真实数据框中有几个国家、几个特征和几个日期。我创建了一个示例来说明我的挑战。日期也按所示顺序排列。

样本数据集

#Create sample dataframe
df <- data.frame(Date = seq(ymd("2020-01-01",tz= Sys.timezone()),ymd("2020-01-05",tz= Sys.timezone()),86400),
                 Country=c(rep("USA",5),rep("INDIA",5),rep("POLAND",5),rep("SWITZERLAND",5)),
                 var1= seq(1,20,1),var2= seq(1,20,1),var3=seq(1,20,1),val_n = seq(1,20,1))

df
         Date     Country var1 var2 var3 val_n
1  2020-01-01         USA    1    1    1     1
2  2020-01-02         USA    2    2    2     2
3  2020-01-03         USA    3    3    3     3
4  2020-01-04         USA    4    4    4     4
5  2020-01-05         USA    5    5    5     5
6  2020-01-01       INDIA    6    6    6     6
7  2020-01-02       INDIA    7    7    7     7
8  2020-01-03       INDIA    8    8    8     8
9  2020-01-04       INDIA    9    9    9     9
10 2020-01-05       INDIA   10   10   10    10
11 2020-01-01      POLAND   11   11   11    11
12 2020-01-02      POLAND   12   12   12    12
13 2020-01-03      POLAND   13   13   13    13
14 2020-01-04      POLAND   14   14   14    14
15 2020-01-05      POLAND   15   15   15    15
16 2020-01-01 SWITZERLAND   16   16   16    16
17 2020-01-02 SWITZERLAND   17   17   17    17
18 2020-01-03 SWITZERLAND   18   18   18    18
19 2020-01-04 SWITZERLAND   19   19   19    19
20 2020-01-05 SWITZERLAND   20   20   20    20

Objective 目标是将每个读数的接下来 4 天的值作为新列,如下所示。如果没有下一个值,则应填充 NA。

预期输出

    Date     Country var1 var2 var3 var_n var1_next_1day var1_next_2day var1_next_3day var1_next_4day Same_var2_n
1  1-Jan         USA    1    1    1     1              2              3              4              5           …
2  2-Jan         USA    2    2    2     2              3              4              5             NA           …
3  3-Jan         USA    3    3    3     3              4              5             NA             NA           …
4  4-Jan         USA    4    4    4     4              5             NA             NA             NA           …
5  5-Jan         USA    5    5    5     5             NA             NA             NA             NA           …
6  1-Jan       INDIA    6    6    6     6              7              8              9             10           …
7  2-Jan       INDIA    7    7    7     7              8              9             10             NA           …
8  3-Jan       INDIA    8    8    8     8              9             10             NA             NA           …
9  4-Jan       INDIA    9    9    9     9             10             NA             NA             NA           …
10 5-Jan       INDIA   10   10   10    10             NA             NA             NA             NA           …
11 1-Jan      POLAND   11   11   11    11             12             13             14             15           …
12 2-Jan      POLAND   12   12   12    12             13             14             15             NA           …
13 3-Jan      POLAND   13   13   13    13             14             15             NA             NA           …
14 4-Jan      POLAND   14   14   14    14             15             NA             NA             NA           …
15 5-Jan      POLAND   15   15   15    15             NA             NA             NA             NA           …
16 1-Jan SWITZERLAND   16   16   16    16             17             18             19             20           …
17 2-Jan SWITZERLAND   17   17   17    17             18             19             20             NA           …
18 3-Jan SWITZERLAND   18   18   18    18             19             20             NA             NA           …
19 4-Jan SWITZERLAND   19   19   19    19             20             NA             NA             NA           …
20 5-Jan SWITZERLAND   20   20   20    20             NA             NA             NA             NA           …

dplyr中,我们可以使用leadmutate_at来将它应用于多列。

library(dplyr)

df %>%
  group_by(Country) %>%
  mutate_at(vars(var1:val_n), list(next_1day = ~lead(.), next_2day = ~lead(., 2),
                               next_3day = ~lead(., 3), next_4day = ~lead(., 4)))
library(data.table)
df <- data.table(df)
#one day
df[,var1_next_1day:=shift(var1,-1),by="Country"]
#second day 
df[,var2_next_2day:=shift(var2,-2),by="Country"]