如何连接两个 Pyspark 数据帧的不同元素

Question

我有两个数据框df1和df2，data数据框的内容如下。

df1:

line_item_usage_account_id  line_item_unblended_cost    name 
100000000001                12.05                       account1
200000000001                52                          account2
300000000003                12.03                       account3

df2:

accountname     accountproviderid   clustername     app_pmo     app_costcenter      line_item_unblended_cost
account1        100000000001        cluster1        111111      11111111            12.05
account2        200000000001        cluster2        222222      22222222            52

我需要将不在 df2.accountproviderid 中的 df1.line_item_usage_account_id 的 ID 添加到连接中，如下所示：

accountname     accountproviderid   clustername     app_pmo     app_costcenter      line_item_unblended_cost
account1        100000000001        cluster1        111111      11111111            12.05
account2        200000000001        cluster2        222222      22222222            52
account3        300000000003        NA              NA          NA                  12.03

在 df2.accountproviderid 中找不到来自 df1.line_item_usage_account_id 的 ID“300000000003”，因此将其添加到新数据框中。

知道如何实现吗？感谢您的帮助。

Answer 1

您可以在此处使用 right join：

df2.join(df1, (df2.accountproviderid == df1.line_item_usage_account_id), "right")\
    .drop("accountname", "accountproviderid")\
    .withColumnRenamed("line_item_usage_account_id", "accountproviderid")\
    .withColumnRenamed("name", "accountname")\
    .select("accountname", "accountproviderid", "clustername", "app_pmo",\
     "app_costcenter", "line_item_unblended_cost").show()

+-----------+-----------------+-----------+-------+--------------+------------------------+
|accountname|accountproviderid|clustername|app_pmo|app_costcenter|line_item_unblended_cost|
+-----------+-----------------+-----------+-------+--------------+------------------------+
|   account1|     100000000001|   cluster1| 111111|      11111111|                   12.05|
|   account2|     200000000001|   cluster2| 222222|      22222222|                    52.0|
|   account3|     300000000003|       null|   null|          null|                   12.03|
+-----------+-----------------+-----------+-------+--------------+------------------------+

如何连接两个 Pyspark 数据帧的不同元素

How to join between different elements of two Pyspark dataframes

python

dataframe

pyspark

pyspark-dataframes