如何连接两个 Pyspark 数据帧的不同元素
How to join between different elements of two Pyspark dataframes
我有两个数据框df1和df2,data数据框的内容如下。
df1:
line_item_usage_account_id line_item_unblended_cost name
100000000001 12.05 account1
200000000001 52 account2
300000000003 12.03 account3
df2:
accountname accountproviderid clustername app_pmo app_costcenter line_item_unblended_cost
account1 100000000001 cluster1 111111 11111111 12.05
account2 200000000001 cluster2 222222 22222222 52
我需要将不在 df2.accountproviderid 中的 df1.line_item_usage_account_id 的 ID 添加到连接中,如下所示:
accountname accountproviderid clustername app_pmo app_costcenter line_item_unblended_cost
account1 100000000001 cluster1 111111 11111111 12.05
account2 200000000001 cluster2 222222 22222222 52
account3 300000000003 NA NA NA 12.03
在 df2.accountproviderid 中找不到来自 df1.line_item_usage_account_id 的 ID“300000000003”,因此将其添加到新数据框中。
知道如何实现吗?感谢您的帮助。
您可以在此处使用 right join
:
df2.join(df1, (df2.accountproviderid == df1.line_item_usage_account_id), "right")\
.drop("accountname", "accountproviderid")\
.withColumnRenamed("line_item_usage_account_id", "accountproviderid")\
.withColumnRenamed("name", "accountname")\
.select("accountname", "accountproviderid", "clustername", "app_pmo",\
"app_costcenter", "line_item_unblended_cost").show()
+-----------+-----------------+-----------+-------+--------------+------------------------+
|accountname|accountproviderid|clustername|app_pmo|app_costcenter|line_item_unblended_cost|
+-----------+-----------------+-----------+-------+--------------+------------------------+
| account1| 100000000001| cluster1| 111111| 11111111| 12.05|
| account2| 200000000001| cluster2| 222222| 22222222| 52.0|
| account3| 300000000003| null| null| null| 12.03|
+-----------+-----------------+-----------+-------+--------------+------------------------+
我有两个数据框df1和df2,data数据框的内容如下。
df1:
line_item_usage_account_id line_item_unblended_cost name
100000000001 12.05 account1
200000000001 52 account2
300000000003 12.03 account3
df2:
accountname accountproviderid clustername app_pmo app_costcenter line_item_unblended_cost
account1 100000000001 cluster1 111111 11111111 12.05
account2 200000000001 cluster2 222222 22222222 52
我需要将不在 df2.accountproviderid 中的 df1.line_item_usage_account_id 的 ID 添加到连接中,如下所示:
accountname accountproviderid clustername app_pmo app_costcenter line_item_unblended_cost
account1 100000000001 cluster1 111111 11111111 12.05
account2 200000000001 cluster2 222222 22222222 52
account3 300000000003 NA NA NA 12.03
在 df2.accountproviderid 中找不到来自 df1.line_item_usage_account_id 的 ID“300000000003”,因此将其添加到新数据框中。
知道如何实现吗?感谢您的帮助。
您可以在此处使用 right join
:
df2.join(df1, (df2.accountproviderid == df1.line_item_usage_account_id), "right")\
.drop("accountname", "accountproviderid")\
.withColumnRenamed("line_item_usage_account_id", "accountproviderid")\
.withColumnRenamed("name", "accountname")\
.select("accountname", "accountproviderid", "clustername", "app_pmo",\
"app_costcenter", "line_item_unblended_cost").show()
+-----------+-----------------+-----------+-------+--------------+------------------------+
|accountname|accountproviderid|clustername|app_pmo|app_costcenter|line_item_unblended_cost|
+-----------+-----------------+-----------+-------+--------------+------------------------+
| account1| 100000000001| cluster1| 111111| 11111111| 12.05|
| account2| 200000000001| cluster2| 222222| 22222222| 52.0|
| account3| 300000000003| null| null| null| 12.03|
+-----------+-----------------+-----------+-------+--------------+------------------------+