如何通过比较它们的列值来有效地连接 2 个数据框

How to efficiently join 2 dataframes by comparing their column values

我有 2 个数据帧 m_df & s_df :-

// m_df Schema

root
 |-- column_A: string (nullable = true)
 |-- column_B: string (nullable = true)
 |-- column_C: string (nullable = true)
 |-- column_D: string (nullable = true)
 |-- column_E: string (nullable = true)
 |-- column_F: string (nullable = true)
 |-- column_G: string (nullable = true)
 |-- column_H: string (nullable = true)
 |-- id: string (nullable = false)
 |-- m_id: string (nullable = false)

+--------+--------+--------+--------+--------+--------+--------+--------------------+--------------------+
|column_A|column_B|column_C|column_D|column_E|column_F|column_G|                  id|                m_id|
+--------+--------+--------+--------+--------+--------+--------+--------------------+--------------------+
|     101|      16|      18|   RANDY|    ANDY|     101|      16|420d6da5-036a-401...|35d2e759-5b94-485...|
|     102|      27|      18|   RANDY|    ANDY|     101|      16|520d6da6-036a-401...|45d2e759-5b94-485...|
|     103|      25|      18|   RANDY|    ANDY|     101|      16|620d6da5-036a-401...|55d2e759-5b94-485...|
|     104|       7|       8|   MANDY|    ANDY|     110|     160|720d6da5-036a-401...|75d2e759-5b94-485...|
|     105|       9|      80|   MANDY|    ANDY|      11|      12|920d6da5-036a-401...|85d2e759-5b94-485...|
+--------+--------+--------+--------+--------+--------+--------+--------------------+--------------------+


// s_df Schema - Renamed the fields to help with the join

root
 |-- s_column_C: string (nullable = true)
 |-- s_column_D: string (nullable = true)
 |-- s_column_E: string (nullable = true)
 |-- s_column_F: string (nullable = true)
 |-- s_column_G: string (nullable = true)
 |-- _id: string (nullable = false)
 |-- s_id: string (nullable = false)

+----------+----------+----------+----------+----------+--------------------+--------------------+
|s_column_C|s_column_D|s_column_E|s_column_F|s_column_G|                 _id|                s_id|
+----------+----------+----------+----------+----------+--------------------+--------------------+
|        18|     RANDY|      ANDY|       101|        16|420d6da5-036a-401...|9ee2e759-5b94-485...|
|         8|     MANDY|      ANDY|       110|       160|720d6da5-036a-401...|3ed2e759-5b94-485...|
|        80|     MANDY|      ANDY|        11|        12|920d6da5-036a-401...|24d2e759-5b94-485...|
+----------+----------+----------+----------+----------+--------------------+--------------------+



我想以这样的方式加入 s_dfm_df 数据帧 (PySpark):

  1. 每个 m_id 应该有一个 s_id。
  2. 每个 s_id 应该有多个 m_id 映射到它。

Expected Result :

+--------------------+--------------------+--------+----------+--------+----------+--------+----------+--------+----------+--------+----------+
|                m_id|                s_id|column_C|s_column_C|column_D|s_column_D|column_E|s_column_E|column_F|s_column_F|column_G|s_column_G|
+--------------------+--------------------+--------+----------+--------+----------+--------+----------+--------+----------+--------+----------+
|35d2e759-5b94-485...|9ee2e759-5b94-485...|      18|        18|   RANDY|     RANDY|    ANDY|      ANDY|     101|       101|      16|        16|
|45d2e759-5b94-485...|9ee2e759-5b94-485...|      18|        18|   RANDY|     RANDY|    ANDY|      ANDY|     101|       101|      16|        16|
|55d2e759-5b94-485...|9ee2e759-5b94-485...|      18|        18|   RANDY|     RANDY|    ANDY|      ANDY|     101|       101|      16|        16|
|75d2e759-5b94-485...|3ed2e759-5b94-485...|       8|         8|   MANDY|     MANDY|    ANDY|      ANDY|     110|       110|     160|       160|
|85d2e759-5b94-485...|24d2e759-5b94-485...|      80|        80|   MANDY|     MANDY|    ANDY|      ANDY|      11|        11|      12|        12|
+--------------------+--------------------+--------+----------+--------+----------+--------+----------+--------+----------+--------+----------+

现在,我可以通过检查每个 m_df.column_C 值和一个值 s_df.s_column_C 并将 s_id 映射到 m_dfs_df 数据帧m_df 即使用 UDF (PySpark)。

但我知道自定义 UDF 通常效率低下,所以我一直在寻找一种更好的方法来执行此连接。

如何有效解决这个问题?

m_s_df = m_df.join(s_df, on=(m_df.column_C == s_df.s_column_C)
                             & (m_df.column_D == s_df.s_column_D)
                             & (m_df.column_E == s_df.s_column_E)
                             & (m_df.column_F == s_df.s_column_F)
                             & (m_df.column_G == s_df.s_column_G),
                         how='left')

我使用上面提到的左连接得到了 m_df 和 s_df 数据帧之间的映射。

'LEFT' 加入 2 个数据帧之间的所有公共列满足以下条件:

  1. 每个 m_id 应该有一个 s_id.
  2. 每个 s_id 应该有多个 m_id 映射到它。

如果有更好的方法,请告诉我。