R dplyr full_join - 没有公共键,需要公共列混合在一起
R dplyr full_join - no common key, need common columns to blend together
例如,我有这两个数据框:
dates = c('2020-11-19', '2020-11-20', '2020-11-21')
df1 <- data.frame(dates, area = c('paris', 'london', 'newyork'),
rating = c(10, 5, 6),
rating2 = c(5, 6, 7))
df2 <- data.frame(dates, area = c('budapest', 'moscow', 'valencia'),
rating = c(1, 2, 1))
> df1
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
> df2
dates area rating
1 2020-11-19 budapest 1
2 2020-11-20 moscow 2
3 2020-11-21 valencia 1
使用 dplyr 执行外部连接时:
df <- df1 %>%
full_join(df2, by = c('dates', 'area'))
结果是这样的:
dates area rating.x rating2 rating.y
1 2020-11-19 paris 10 5 NA
2 2020-11-20 london 5 6 NA
3 2020-11-21 newyork 6 7 NA
4 2020-11-19 budapest NA NA 1
5 2020-11-20 moscow NA NA 2
6 2020-11-21 valencia NA NA 1
即两个数据框中的评级列没有混合在一起,而是创建了两个单独的列。
如何获得这样的结果?
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 NA
5 2020-11-20 moscow 2 NA
6 2020-11-21 valencia 1 NA
感谢@kybazzi 提供的解决方案,得到了想要的结果
df <- df1 %>%
bind_rows(df2)
跟进
作为后续问题,我想将以下内容加入到已加入的数据框中:
df3 <- data.frame(dates, area = c('budapest', 'moscow', 'valencia'),
rating2 = c(3, 2, 5))
用同样的方法,结果是这样的:
> df_final <- df %>%
+ bind_rows(df3)
> df_final
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 NA
5 2020-11-20 moscow 2 NA
6 2020-11-21 valencia 1 NA
7 2020-11-19 budapest NA 3
8 2020-11-20 moscow NA 2
9 2020-11-21 valencia NA 5
如何得到这样的结果:
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 3
5 2020-11-20 moscow 2 2
6 2020-11-21 valencia 1 5
您要查找的是 dplyr::bind_rows()
,它将保留公共列并填充 NA
仅存在于其中一个数据框中的列:
> bind_rows(df1, df2)
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 NA
5 2020-11-20 moscow 2 NA
6 2020-11-21 valencia 1 NA
请注意,您也可以继续使用 full_join()
- 但如果您不想拆分列,则必须确保将数据框之间的所有公共列作为键包括在内:
> full_join(
+ df1, df2,
+ by = c("dates", "area", "rating")
+ )
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 NA
5 2020-11-20 moscow 2 NA
6 2020-11-21 valencia 1 NA
dplyr 连接的文档提到:
Output columns include all x
columns and all y
columns. If columns in x
and y
have the same name (and aren't included in by
), suffixes are added to disambiguate.
您也可以通过不指定 by
来避免此问题,在这种情况下 dplyr 将使用所有公共列。
> full_join(df1, df2)
Joining, by = c("dates", "area", "rating")
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 NA
5 2020-11-20 moscow 2 NA
6 2020-11-21 valencia 1 NA
据我所知,这两种方法都适合您的用例。事实上,我相信 full_join()
相对于 bind_rows()
的实际优势正是您要在此处避免的这种行为,即拆分不是键的列。
例如,我有这两个数据框:
dates = c('2020-11-19', '2020-11-20', '2020-11-21')
df1 <- data.frame(dates, area = c('paris', 'london', 'newyork'),
rating = c(10, 5, 6),
rating2 = c(5, 6, 7))
df2 <- data.frame(dates, area = c('budapest', 'moscow', 'valencia'),
rating = c(1, 2, 1))
> df1
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
> df2
dates area rating
1 2020-11-19 budapest 1
2 2020-11-20 moscow 2
3 2020-11-21 valencia 1
使用 dplyr 执行外部连接时:
df <- df1 %>%
full_join(df2, by = c('dates', 'area'))
结果是这样的:
dates area rating.x rating2 rating.y
1 2020-11-19 paris 10 5 NA
2 2020-11-20 london 5 6 NA
3 2020-11-21 newyork 6 7 NA
4 2020-11-19 budapest NA NA 1
5 2020-11-20 moscow NA NA 2
6 2020-11-21 valencia NA NA 1
即两个数据框中的评级列没有混合在一起,而是创建了两个单独的列。
如何获得这样的结果?
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 NA
5 2020-11-20 moscow 2 NA
6 2020-11-21 valencia 1 NA
感谢@kybazzi 提供的解决方案,得到了想要的结果
df <- df1 %>%
bind_rows(df2)
跟进
作为后续问题,我想将以下内容加入到已加入的数据框中:
df3 <- data.frame(dates, area = c('budapest', 'moscow', 'valencia'),
rating2 = c(3, 2, 5))
用同样的方法,结果是这样的:
> df_final <- df %>%
+ bind_rows(df3)
> df_final
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 NA
5 2020-11-20 moscow 2 NA
6 2020-11-21 valencia 1 NA
7 2020-11-19 budapest NA 3
8 2020-11-20 moscow NA 2
9 2020-11-21 valencia NA 5
如何得到这样的结果:
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 3
5 2020-11-20 moscow 2 2
6 2020-11-21 valencia 1 5
您要查找的是 dplyr::bind_rows()
,它将保留公共列并填充 NA
仅存在于其中一个数据框中的列:
> bind_rows(df1, df2)
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 NA
5 2020-11-20 moscow 2 NA
6 2020-11-21 valencia 1 NA
请注意,您也可以继续使用 full_join()
- 但如果您不想拆分列,则必须确保将数据框之间的所有公共列作为键包括在内:
> full_join(
+ df1, df2,
+ by = c("dates", "area", "rating")
+ )
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 NA
5 2020-11-20 moscow 2 NA
6 2020-11-21 valencia 1 NA
dplyr 连接的文档提到:
Output columns include all
x
columns and ally
columns. If columns inx
andy
have the same name (and aren't included inby
), suffixes are added to disambiguate.
您也可以通过不指定 by
来避免此问题,在这种情况下 dplyr 将使用所有公共列。
> full_join(df1, df2)
Joining, by = c("dates", "area", "rating")
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 NA
5 2020-11-20 moscow 2 NA
6 2020-11-21 valencia 1 NA
据我所知,这两种方法都适合您的用例。事实上,我相信 full_join()
相对于 bind_rows()
的实际优势正是您要在此处避免的这种行为,即拆分不是键的列。