为 R 中数据框中的每一列创建接下来的 4 个主要特征
Creating next 4 lead features for every column in a data frame in R
我有一个非常大的数据框,其中包含 100 多个国家/地区的多个读数(多列),每个读数 98 天。
真实数据框中有几个国家、几个特征和几个日期。我创建了一个示例来说明我的挑战。日期也按所示顺序排列。
样本数据集
#Create sample dataframe
df <- data.frame(Date = seq(ymd("2020-01-01",tz= Sys.timezone()),ymd("2020-01-05",tz= Sys.timezone()),86400),
Country=c(rep("USA",5),rep("INDIA",5),rep("POLAND",5),rep("SWITZERLAND",5)),
var1= seq(1,20,1),var2= seq(1,20,1),var3=seq(1,20,1),val_n = seq(1,20,1))
df
Date Country var1 var2 var3 val_n
1 2020-01-01 USA 1 1 1 1
2 2020-01-02 USA 2 2 2 2
3 2020-01-03 USA 3 3 3 3
4 2020-01-04 USA 4 4 4 4
5 2020-01-05 USA 5 5 5 5
6 2020-01-01 INDIA 6 6 6 6
7 2020-01-02 INDIA 7 7 7 7
8 2020-01-03 INDIA 8 8 8 8
9 2020-01-04 INDIA 9 9 9 9
10 2020-01-05 INDIA 10 10 10 10
11 2020-01-01 POLAND 11 11 11 11
12 2020-01-02 POLAND 12 12 12 12
13 2020-01-03 POLAND 13 13 13 13
14 2020-01-04 POLAND 14 14 14 14
15 2020-01-05 POLAND 15 15 15 15
16 2020-01-01 SWITZERLAND 16 16 16 16
17 2020-01-02 SWITZERLAND 17 17 17 17
18 2020-01-03 SWITZERLAND 18 18 18 18
19 2020-01-04 SWITZERLAND 19 19 19 19
20 2020-01-05 SWITZERLAND 20 20 20 20
Objective
目标是将每个读数的接下来 4 天的值作为新列,如下所示。如果没有下一个值,则应填充 NA。
预期输出
Date Country var1 var2 var3 var_n var1_next_1day var1_next_2day var1_next_3day var1_next_4day Same_var2_n
1 1-Jan USA 1 1 1 1 2 3 4 5 …
2 2-Jan USA 2 2 2 2 3 4 5 NA …
3 3-Jan USA 3 3 3 3 4 5 NA NA …
4 4-Jan USA 4 4 4 4 5 NA NA NA …
5 5-Jan USA 5 5 5 5 NA NA NA NA …
6 1-Jan INDIA 6 6 6 6 7 8 9 10 …
7 2-Jan INDIA 7 7 7 7 8 9 10 NA …
8 3-Jan INDIA 8 8 8 8 9 10 NA NA …
9 4-Jan INDIA 9 9 9 9 10 NA NA NA …
10 5-Jan INDIA 10 10 10 10 NA NA NA NA …
11 1-Jan POLAND 11 11 11 11 12 13 14 15 …
12 2-Jan POLAND 12 12 12 12 13 14 15 NA …
13 3-Jan POLAND 13 13 13 13 14 15 NA NA …
14 4-Jan POLAND 14 14 14 14 15 NA NA NA …
15 5-Jan POLAND 15 15 15 15 NA NA NA NA …
16 1-Jan SWITZERLAND 16 16 16 16 17 18 19 20 …
17 2-Jan SWITZERLAND 17 17 17 17 18 19 20 NA …
18 3-Jan SWITZERLAND 18 18 18 18 19 20 NA NA …
19 4-Jan SWITZERLAND 19 19 19 19 20 NA NA NA …
20 5-Jan SWITZERLAND 20 20 20 20 NA NA NA NA …
在dplyr
中,我们可以使用lead
和mutate_at
来将它应用于多列。
library(dplyr)
df %>%
group_by(Country) %>%
mutate_at(vars(var1:val_n), list(next_1day = ~lead(.), next_2day = ~lead(., 2),
next_3day = ~lead(., 3), next_4day = ~lead(., 4)))
library(data.table)
df <- data.table(df)
#one day
df[,var1_next_1day:=shift(var1,-1),by="Country"]
#second day
df[,var2_next_2day:=shift(var2,-2),by="Country"]
我有一个非常大的数据框,其中包含 100 多个国家/地区的多个读数(多列),每个读数 98 天。
真实数据框中有几个国家、几个特征和几个日期。我创建了一个示例来说明我的挑战。日期也按所示顺序排列。
样本数据集
#Create sample dataframe
df <- data.frame(Date = seq(ymd("2020-01-01",tz= Sys.timezone()),ymd("2020-01-05",tz= Sys.timezone()),86400),
Country=c(rep("USA",5),rep("INDIA",5),rep("POLAND",5),rep("SWITZERLAND",5)),
var1= seq(1,20,1),var2= seq(1,20,1),var3=seq(1,20,1),val_n = seq(1,20,1))
df
Date Country var1 var2 var3 val_n
1 2020-01-01 USA 1 1 1 1
2 2020-01-02 USA 2 2 2 2
3 2020-01-03 USA 3 3 3 3
4 2020-01-04 USA 4 4 4 4
5 2020-01-05 USA 5 5 5 5
6 2020-01-01 INDIA 6 6 6 6
7 2020-01-02 INDIA 7 7 7 7
8 2020-01-03 INDIA 8 8 8 8
9 2020-01-04 INDIA 9 9 9 9
10 2020-01-05 INDIA 10 10 10 10
11 2020-01-01 POLAND 11 11 11 11
12 2020-01-02 POLAND 12 12 12 12
13 2020-01-03 POLAND 13 13 13 13
14 2020-01-04 POLAND 14 14 14 14
15 2020-01-05 POLAND 15 15 15 15
16 2020-01-01 SWITZERLAND 16 16 16 16
17 2020-01-02 SWITZERLAND 17 17 17 17
18 2020-01-03 SWITZERLAND 18 18 18 18
19 2020-01-04 SWITZERLAND 19 19 19 19
20 2020-01-05 SWITZERLAND 20 20 20 20
Objective 目标是将每个读数的接下来 4 天的值作为新列,如下所示。如果没有下一个值,则应填充 NA。
预期输出
Date Country var1 var2 var3 var_n var1_next_1day var1_next_2day var1_next_3day var1_next_4day Same_var2_n
1 1-Jan USA 1 1 1 1 2 3 4 5 …
2 2-Jan USA 2 2 2 2 3 4 5 NA …
3 3-Jan USA 3 3 3 3 4 5 NA NA …
4 4-Jan USA 4 4 4 4 5 NA NA NA …
5 5-Jan USA 5 5 5 5 NA NA NA NA …
6 1-Jan INDIA 6 6 6 6 7 8 9 10 …
7 2-Jan INDIA 7 7 7 7 8 9 10 NA …
8 3-Jan INDIA 8 8 8 8 9 10 NA NA …
9 4-Jan INDIA 9 9 9 9 10 NA NA NA …
10 5-Jan INDIA 10 10 10 10 NA NA NA NA …
11 1-Jan POLAND 11 11 11 11 12 13 14 15 …
12 2-Jan POLAND 12 12 12 12 13 14 15 NA …
13 3-Jan POLAND 13 13 13 13 14 15 NA NA …
14 4-Jan POLAND 14 14 14 14 15 NA NA NA …
15 5-Jan POLAND 15 15 15 15 NA NA NA NA …
16 1-Jan SWITZERLAND 16 16 16 16 17 18 19 20 …
17 2-Jan SWITZERLAND 17 17 17 17 18 19 20 NA …
18 3-Jan SWITZERLAND 18 18 18 18 19 20 NA NA …
19 4-Jan SWITZERLAND 19 19 19 19 20 NA NA NA …
20 5-Jan SWITZERLAND 20 20 20 20 NA NA NA NA …
在dplyr
中,我们可以使用lead
和mutate_at
来将它应用于多列。
library(dplyr)
df %>%
group_by(Country) %>%
mutate_at(vars(var1:val_n), list(next_1day = ~lead(.), next_2day = ~lead(., 2),
next_3day = ~lead(., 3), next_4day = ~lead(., 4)))
library(data.table)
df <- data.table(df)
#one day
df[,var1_next_1day:=shift(var1,-1),by="Country"]
#second day
df[,var2_next_2day:=shift(var2,-2),by="Country"]