使用 dplyr 基于列值连接多个数据帧
Joining multiple data frames based on columns values using dplyr
我有如下三个相似的数据框:
df1<-data.frame(Campaign_Name=c("Z019","Z005","Z019","Z005","Z019"),
Sunday_endwk=c("20190106","20190113","20190113","20190106","20190106"),
Actual_Sales=c(12,2,5,10,12.11),
Predictions=c(11.9,2.03,5.1,10.5,11.7),
Version=c("layer_1","layer_1","layer_1","layer_1","layer_1"),
Adj_Rsquared=c(0.85,0.85,0.85,0.85,0.85))
df1
Campaign_Name Sunday_endwk Actual_Sales Predictions Version Adj_Rsquared
1 Z019 20190106 12.00 11.90 layer_1 0.85
2 Z005 20190113 2.00 2.03 layer_1 0.85
3 Z019 20190113 5.00 5.10 layer_1 0.85
4 Z005 20190106 10.00 10.50 layer_1 0.85
5 Z019 20190106 12.11 11.70 layer_1 0.85
同理,另外两个df是:
df2<-data.frame(Campaign_Name=c("Z019","Z019","Z005","Z005"),
Sunday_endwk=c("20190106","20190113","20190106","20190113"),
Actual_Sales=c(12.2,2.2,5.2,10.2),
Predictions=c(11.8,2.05,5.4,10.1),
Version=c("layer_2","layer_2","layer_2","layer_2"),
Adj_Rsquared=c(0.88,0.88,0.88,0.88))
#df2
df3<-data.frame(Campaign_Name=c("Z005","Z019","Z019","Z005","Z019"),
Sunday_endwk=c("20190106","20190106","20190120","20190113","20190113"),
Actual_Sales=c(12,2,5,10,12),
Predictions=c(11.9,2.03,5.1,10.5,12.3),
Version=c("layer_3","layer_3","layer_3","layer_3","layer_3"),
Adj_Rsquared=c(0.82,0.82,0.82,0.82,0.82))
#df3
## expected output
我正在尝试根据 Campaign_Name
+ Sunday_endwk
的组合将所有 3 个 dfs 合并并转换为宽格式(两者都应该在 3 个 dfs 之间匹配),如下所示:
Campaign_Name Sunday_endwk Actual_Sales_layer_1 Predictions_layer_1 Adj_Rsquared_layer_1 Actual_Sales_layer_2
1 Z019 20190106 12 11.90 0.85 12.2
2 Z005 20190113 2 2.03 0.85 10.2
3 Z019 20190113 5 5.10 0.85 2.2
4 Z005 20190106 10 10.50 0.85 5.2
Predictions_layer_2 Adj_Rsquared_layer_2 Actual_Sales_layer_3 Predictions_layer_3 Adj_Rsquared_layer_3
1 11.80 0.88 2 2.03 0.82
2 10.10 0.88 10 10.50 0.82
3 2.05 0.88 12 12.30 0.82
4 5.40 0.88 12 11.90 0.82
如果 Campaign_Name
+ Sunday_endwk
中的任何一个值不存在于任何 df 中,则该行:
- 可以省略
- 保留其他列的 NA
同样在 df 中,Campaign_Name
+ Sunday_endwk
组合可能不是唯一的。
如有任何帮助,我们将不胜感激。
谢谢。
library(tidyverse)
bind_rows(df1, df2, df3, .id = "week") %>%
rowid_to_column() %>% # Added for nonunique combos of Camp/Sunday_endwk
pivot_wider(c(Campaign_Name, Sunday_endwk, rowid),
names_from = week, values_from = Actual_Sales:Adj_Rsquared)
结果:
# A tibble: 5 x 14
Campaign_Name Sunday_endwk Actual_Sales_1 Actual_Sales_2 Actual_Sales_3 Predictions_1 Predictions_2 Predictions_3 Version_1 Version_2 Version_3 Adj_Rsquared_1 Adj_Rsquared_2 Adj_Rsquared_3
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Z019 20190106 12 12.2 2 11.9 11.8 2.03 layer_1 layer_2 layer_3 0.85 0.88 0.82
2 Z005 20190113 2 10.2 10 2.03 10.1 10.5 layer_1 layer_2 layer_3 0.85 0.88 0.82
3 Z019 20190113 5 2.2 12 5.1 2.05 12.3 layer_1 layer_2 layer_3 0.85 0.88 0.82
4 Z005 20190106 10 5.2 12 10.5 5.4 11.9 layer_1 layer_2 layer_3 0.85 0.88 0.82
5 Z019 20190120 NA NA 5 NA NA 5.1 NA NA layer_3 NA NA 0.82
我有如下三个相似的数据框:
df1<-data.frame(Campaign_Name=c("Z019","Z005","Z019","Z005","Z019"),
Sunday_endwk=c("20190106","20190113","20190113","20190106","20190106"),
Actual_Sales=c(12,2,5,10,12.11),
Predictions=c(11.9,2.03,5.1,10.5,11.7),
Version=c("layer_1","layer_1","layer_1","layer_1","layer_1"),
Adj_Rsquared=c(0.85,0.85,0.85,0.85,0.85))
df1
Campaign_Name Sunday_endwk Actual_Sales Predictions Version Adj_Rsquared
1 Z019 20190106 12.00 11.90 layer_1 0.85
2 Z005 20190113 2.00 2.03 layer_1 0.85
3 Z019 20190113 5.00 5.10 layer_1 0.85
4 Z005 20190106 10.00 10.50 layer_1 0.85
5 Z019 20190106 12.11 11.70 layer_1 0.85
同理,另外两个df是:
df2<-data.frame(Campaign_Name=c("Z019","Z019","Z005","Z005"),
Sunday_endwk=c("20190106","20190113","20190106","20190113"),
Actual_Sales=c(12.2,2.2,5.2,10.2),
Predictions=c(11.8,2.05,5.4,10.1),
Version=c("layer_2","layer_2","layer_2","layer_2"),
Adj_Rsquared=c(0.88,0.88,0.88,0.88))
#df2
df3<-data.frame(Campaign_Name=c("Z005","Z019","Z019","Z005","Z019"),
Sunday_endwk=c("20190106","20190106","20190120","20190113","20190113"),
Actual_Sales=c(12,2,5,10,12),
Predictions=c(11.9,2.03,5.1,10.5,12.3),
Version=c("layer_3","layer_3","layer_3","layer_3","layer_3"),
Adj_Rsquared=c(0.82,0.82,0.82,0.82,0.82))
#df3
## expected output
我正在尝试根据 Campaign_Name
+ Sunday_endwk
的组合将所有 3 个 dfs 合并并转换为宽格式(两者都应该在 3 个 dfs 之间匹配),如下所示:
Campaign_Name Sunday_endwk Actual_Sales_layer_1 Predictions_layer_1 Adj_Rsquared_layer_1 Actual_Sales_layer_2
1 Z019 20190106 12 11.90 0.85 12.2
2 Z005 20190113 2 2.03 0.85 10.2
3 Z019 20190113 5 5.10 0.85 2.2
4 Z005 20190106 10 10.50 0.85 5.2
Predictions_layer_2 Adj_Rsquared_layer_2 Actual_Sales_layer_3 Predictions_layer_3 Adj_Rsquared_layer_3
1 11.80 0.88 2 2.03 0.82
2 10.10 0.88 10 10.50 0.82
3 2.05 0.88 12 12.30 0.82
4 5.40 0.88 12 11.90 0.82
如果 Campaign_Name
+ Sunday_endwk
中的任何一个值不存在于任何 df 中,则该行:
- 可以省略
- 保留其他列的 NA
同样在 df 中,Campaign_Name
+ Sunday_endwk
组合可能不是唯一的。
如有任何帮助,我们将不胜感激。
谢谢。
library(tidyverse)
bind_rows(df1, df2, df3, .id = "week") %>%
rowid_to_column() %>% # Added for nonunique combos of Camp/Sunday_endwk
pivot_wider(c(Campaign_Name, Sunday_endwk, rowid),
names_from = week, values_from = Actual_Sales:Adj_Rsquared)
结果:
# A tibble: 5 x 14
Campaign_Name Sunday_endwk Actual_Sales_1 Actual_Sales_2 Actual_Sales_3 Predictions_1 Predictions_2 Predictions_3 Version_1 Version_2 Version_3 Adj_Rsquared_1 Adj_Rsquared_2 Adj_Rsquared_3
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Z019 20190106 12 12.2 2 11.9 11.8 2.03 layer_1 layer_2 layer_3 0.85 0.88 0.82
2 Z005 20190113 2 10.2 10 2.03 10.1 10.5 layer_1 layer_2 layer_3 0.85 0.88 0.82
3 Z019 20190113 5 2.2 12 5.1 2.05 12.3 layer_1 layer_2 layer_3 0.85 0.88 0.82
4 Z005 20190106 10 5.2 12 10.5 5.4 11.9 layer_1 layer_2 layer_3 0.85 0.88 0.82
5 Z019 20190120 NA NA 5 NA NA 5.1 NA NA layer_3 NA NA 0.82