如何通过比较它们的列值来有效地连接 2 个数据框
How to efficiently join 2 dataframes by comparing their column values
我有 2 个数据帧 m_df & s_df :-
// m_df Schema
root
|-- column_A: string (nullable = true)
|-- column_B: string (nullable = true)
|-- column_C: string (nullable = true)
|-- column_D: string (nullable = true)
|-- column_E: string (nullable = true)
|-- column_F: string (nullable = true)
|-- column_G: string (nullable = true)
|-- column_H: string (nullable = true)
|-- id: string (nullable = false)
|-- m_id: string (nullable = false)
+--------+--------+--------+--------+--------+--------+--------+--------------------+--------------------+
|column_A|column_B|column_C|column_D|column_E|column_F|column_G| id| m_id|
+--------+--------+--------+--------+--------+--------+--------+--------------------+--------------------+
| 101| 16| 18| RANDY| ANDY| 101| 16|420d6da5-036a-401...|35d2e759-5b94-485...|
| 102| 27| 18| RANDY| ANDY| 101| 16|520d6da6-036a-401...|45d2e759-5b94-485...|
| 103| 25| 18| RANDY| ANDY| 101| 16|620d6da5-036a-401...|55d2e759-5b94-485...|
| 104| 7| 8| MANDY| ANDY| 110| 160|720d6da5-036a-401...|75d2e759-5b94-485...|
| 105| 9| 80| MANDY| ANDY| 11| 12|920d6da5-036a-401...|85d2e759-5b94-485...|
+--------+--------+--------+--------+--------+--------+--------+--------------------+--------------------+
// s_df Schema - Renamed the fields to help with the join
root
|-- s_column_C: string (nullable = true)
|-- s_column_D: string (nullable = true)
|-- s_column_E: string (nullable = true)
|-- s_column_F: string (nullable = true)
|-- s_column_G: string (nullable = true)
|-- _id: string (nullable = false)
|-- s_id: string (nullable = false)
+----------+----------+----------+----------+----------+--------------------+--------------------+
|s_column_C|s_column_D|s_column_E|s_column_F|s_column_G| _id| s_id|
+----------+----------+----------+----------+----------+--------------------+--------------------+
| 18| RANDY| ANDY| 101| 16|420d6da5-036a-401...|9ee2e759-5b94-485...|
| 8| MANDY| ANDY| 110| 160|720d6da5-036a-401...|3ed2e759-5b94-485...|
| 80| MANDY| ANDY| 11| 12|920d6da5-036a-401...|24d2e759-5b94-485...|
+----------+----------+----------+----------+----------+--------------------+--------------------+
我想以这样的方式加入 s_df
和 m_df
数据帧 (PySpark):
- 每个 m_id 应该有一个 s_id。
- 每个 s_id 应该有多个 m_id 映射到它。
Expected Result :
+--------------------+--------------------+--------+----------+--------+----------+--------+----------+--------+----------+--------+----------+
| m_id| s_id|column_C|s_column_C|column_D|s_column_D|column_E|s_column_E|column_F|s_column_F|column_G|s_column_G|
+--------------------+--------------------+--------+----------+--------+----------+--------+----------+--------+----------+--------+----------+
|35d2e759-5b94-485...|9ee2e759-5b94-485...| 18| 18| RANDY| RANDY| ANDY| ANDY| 101| 101| 16| 16|
|45d2e759-5b94-485...|9ee2e759-5b94-485...| 18| 18| RANDY| RANDY| ANDY| ANDY| 101| 101| 16| 16|
|55d2e759-5b94-485...|9ee2e759-5b94-485...| 18| 18| RANDY| RANDY| ANDY| ANDY| 101| 101| 16| 16|
|75d2e759-5b94-485...|3ed2e759-5b94-485...| 8| 8| MANDY| MANDY| ANDY| ANDY| 110| 110| 160| 160|
|85d2e759-5b94-485...|24d2e759-5b94-485...| 80| 80| MANDY| MANDY| ANDY| ANDY| 11| 11| 12| 12|
+--------------------+--------------------+--------+----------+--------+----------+--------+----------+--------+----------+--------+----------+
现在,我可以通过检查每个 m_df.column_C 值和一个值 s_df.s_column_C 并将 s_id 映射到 m_df
和 s_df
数据帧m_df
即使用 UDF (PySpark)。
但我知道自定义 UDF 通常效率低下,所以我一直在寻找一种更好的方法来执行此连接。
如何有效解决这个问题?
m_s_df = m_df.join(s_df, on=(m_df.column_C == s_df.s_column_C)
& (m_df.column_D == s_df.s_column_D)
& (m_df.column_E == s_df.s_column_E)
& (m_df.column_F == s_df.s_column_F)
& (m_df.column_G == s_df.s_column_G),
how='left')
我使用上面提到的左连接得到了 m_df 和 s_df 数据帧之间的映射。
'LEFT' 加入 2 个数据帧之间的所有公共列满足以下条件:
- 每个 m_id 应该有一个 s_id.
- 每个 s_id 应该有多个 m_id 映射到它。
如果有更好的方法,请告诉我。
我有 2 个数据帧 m_df & s_df :-
// m_df Schema
root
|-- column_A: string (nullable = true)
|-- column_B: string (nullable = true)
|-- column_C: string (nullable = true)
|-- column_D: string (nullable = true)
|-- column_E: string (nullable = true)
|-- column_F: string (nullable = true)
|-- column_G: string (nullable = true)
|-- column_H: string (nullable = true)
|-- id: string (nullable = false)
|-- m_id: string (nullable = false)
+--------+--------+--------+--------+--------+--------+--------+--------------------+--------------------+
|column_A|column_B|column_C|column_D|column_E|column_F|column_G| id| m_id|
+--------+--------+--------+--------+--------+--------+--------+--------------------+--------------------+
| 101| 16| 18| RANDY| ANDY| 101| 16|420d6da5-036a-401...|35d2e759-5b94-485...|
| 102| 27| 18| RANDY| ANDY| 101| 16|520d6da6-036a-401...|45d2e759-5b94-485...|
| 103| 25| 18| RANDY| ANDY| 101| 16|620d6da5-036a-401...|55d2e759-5b94-485...|
| 104| 7| 8| MANDY| ANDY| 110| 160|720d6da5-036a-401...|75d2e759-5b94-485...|
| 105| 9| 80| MANDY| ANDY| 11| 12|920d6da5-036a-401...|85d2e759-5b94-485...|
+--------+--------+--------+--------+--------+--------+--------+--------------------+--------------------+
// s_df Schema - Renamed the fields to help with the join
root
|-- s_column_C: string (nullable = true)
|-- s_column_D: string (nullable = true)
|-- s_column_E: string (nullable = true)
|-- s_column_F: string (nullable = true)
|-- s_column_G: string (nullable = true)
|-- _id: string (nullable = false)
|-- s_id: string (nullable = false)
+----------+----------+----------+----------+----------+--------------------+--------------------+
|s_column_C|s_column_D|s_column_E|s_column_F|s_column_G| _id| s_id|
+----------+----------+----------+----------+----------+--------------------+--------------------+
| 18| RANDY| ANDY| 101| 16|420d6da5-036a-401...|9ee2e759-5b94-485...|
| 8| MANDY| ANDY| 110| 160|720d6da5-036a-401...|3ed2e759-5b94-485...|
| 80| MANDY| ANDY| 11| 12|920d6da5-036a-401...|24d2e759-5b94-485...|
+----------+----------+----------+----------+----------+--------------------+--------------------+
我想以这样的方式加入 s_df
和 m_df
数据帧 (PySpark):
- 每个 m_id 应该有一个 s_id。
- 每个 s_id 应该有多个 m_id 映射到它。
Expected Result :
+--------------------+--------------------+--------+----------+--------+----------+--------+----------+--------+----------+--------+----------+
| m_id| s_id|column_C|s_column_C|column_D|s_column_D|column_E|s_column_E|column_F|s_column_F|column_G|s_column_G|
+--------------------+--------------------+--------+----------+--------+----------+--------+----------+--------+----------+--------+----------+
|35d2e759-5b94-485...|9ee2e759-5b94-485...| 18| 18| RANDY| RANDY| ANDY| ANDY| 101| 101| 16| 16|
|45d2e759-5b94-485...|9ee2e759-5b94-485...| 18| 18| RANDY| RANDY| ANDY| ANDY| 101| 101| 16| 16|
|55d2e759-5b94-485...|9ee2e759-5b94-485...| 18| 18| RANDY| RANDY| ANDY| ANDY| 101| 101| 16| 16|
|75d2e759-5b94-485...|3ed2e759-5b94-485...| 8| 8| MANDY| MANDY| ANDY| ANDY| 110| 110| 160| 160|
|85d2e759-5b94-485...|24d2e759-5b94-485...| 80| 80| MANDY| MANDY| ANDY| ANDY| 11| 11| 12| 12|
+--------------------+--------------------+--------+----------+--------+----------+--------+----------+--------+----------+--------+----------+
现在,我可以通过检查每个 m_df.column_C 值和一个值 s_df.s_column_C 并将 s_id 映射到 m_df
和 s_df
数据帧m_df
即使用 UDF (PySpark)。
但我知道自定义 UDF 通常效率低下,所以我一直在寻找一种更好的方法来执行此连接。
如何有效解决这个问题?
m_s_df = m_df.join(s_df, on=(m_df.column_C == s_df.s_column_C)
& (m_df.column_D == s_df.s_column_D)
& (m_df.column_E == s_df.s_column_E)
& (m_df.column_F == s_df.s_column_F)
& (m_df.column_G == s_df.s_column_G),
how='left')
我使用上面提到的左连接得到了 m_df 和 s_df 数据帧之间的映射。
'LEFT' 加入 2 个数据帧之间的所有公共列满足以下条件:
- 每个 m_id 应该有一个 s_id.
- 每个 s_id 应该有多个 m_id 映射到它。
如果有更好的方法,请告诉我。