PySpark Dataframes:带条件的完全外部连接
PySpark Dataframes: Full Outer Join with a condition
我有以下2个数据框-
dataframe_a
+----------------+---------------+
| user_id| domain|
+----------------+---------------+
| josh| wanadoo.fr|
| samantha| randomn.fr|
| bob| eidsiva.net|
| dylan| vodafone.it|
+----------------+---------------+
dataframe_b
+----------------+---------------+
| user_id| domain|
+----------------+---------------+
| josh| oldwebsite.fr|
| samantha| randomn.fr|
| dylan| oldweb.it|
| ryan| chicks.it|
+----------------+---------------+
我想进行完全外部联接,但保留 dataframe_a
的 domain
列中的值,以防我为单个 user_id
获得 2 个不同的域。所以,我想要的数据框看起来像-
desired_df
+----------------+---------------+
| user_id| domain|
+----------------+---------------+
| josh| wanadoo.fr|
| samantha| randomn.fr|
| bob| eidsiva.net|
| dylan| vodafone.it|
| ryan| chicks.it|
+----------------+---------------+
我想我可以做一些像-
desired_df = dataframe_a.join(dataframe_b, ["user_id"], how="full_outer").drop(dataframe_b.domain)
但我担心这是否会在我想要的数据框中给我 ryan
。这是正确的方法吗?
不,执行 full_outer 连接将留下所需的数据框,其中与 ryan 对应的域名为 null value.No 对上述给定数据框的连接操作类型将为您提供所需的输出.
您将要使用 'coalesce'。在您当前的解决方案中,ryan 将在生成的数据框中,但剩余的 dataframe_a.domain
列为空值。
joined_df = dataframe_a.join(dataframe_b, ["user_id"], how="full_outer")
+----------------+---------------+---------------+
| user_id| domain| domain|
+----------------+---------------+---------------+
| josh| wanadoo.fr| oldwebsite.fr|
| samantha| randomn.fr| randomn.fr|
| bob| eidsiva.net| |
| dylan| vodafone.it| oldweb.it|
| ryan| | chicks.it|
+----------------+---------------+---------------+
'coalesce' 允许您指定优先顺序,但会跳过空值。
import pyspark.sql.functions as F
joined_df = joined_df.withColumn(
"preferred_domain",
F.coalesce(dataframe_a.domain, dataframe_b.domain)
)
joined_df = joined_df.drop(dataframe_a.domain).drop(dataframe_b.domain)
给予
+----------------+----------------+
| user_id|preferred_domain|
+----------------+----------------+
| josh| wanadoo.fr|
| samantha| randomn.fr|
| bob| eidsiva.net|
| dylan| vodafone.it|
| ryan| chicks.it|
+----------------+----------------+
我有以下2个数据框-
dataframe_a
+----------------+---------------+
| user_id| domain|
+----------------+---------------+
| josh| wanadoo.fr|
| samantha| randomn.fr|
| bob| eidsiva.net|
| dylan| vodafone.it|
+----------------+---------------+
dataframe_b
+----------------+---------------+
| user_id| domain|
+----------------+---------------+
| josh| oldwebsite.fr|
| samantha| randomn.fr|
| dylan| oldweb.it|
| ryan| chicks.it|
+----------------+---------------+
我想进行完全外部联接,但保留 dataframe_a
的 domain
列中的值,以防我为单个 user_id
获得 2 个不同的域。所以,我想要的数据框看起来像-
desired_df
+----------------+---------------+
| user_id| domain|
+----------------+---------------+
| josh| wanadoo.fr|
| samantha| randomn.fr|
| bob| eidsiva.net|
| dylan| vodafone.it|
| ryan| chicks.it|
+----------------+---------------+
我想我可以做一些像-
desired_df = dataframe_a.join(dataframe_b, ["user_id"], how="full_outer").drop(dataframe_b.domain)
但我担心这是否会在我想要的数据框中给我 ryan
。这是正确的方法吗?
不,执行 full_outer 连接将留下所需的数据框,其中与 ryan 对应的域名为 null value.No 对上述给定数据框的连接操作类型将为您提供所需的输出.
您将要使用 'coalesce'。在您当前的解决方案中,ryan 将在生成的数据框中,但剩余的 dataframe_a.domain
列为空值。
joined_df = dataframe_a.join(dataframe_b, ["user_id"], how="full_outer")
+----------------+---------------+---------------+
| user_id| domain| domain|
+----------------+---------------+---------------+
| josh| wanadoo.fr| oldwebsite.fr|
| samantha| randomn.fr| randomn.fr|
| bob| eidsiva.net| |
| dylan| vodafone.it| oldweb.it|
| ryan| | chicks.it|
+----------------+---------------+---------------+
'coalesce' 允许您指定优先顺序,但会跳过空值。
import pyspark.sql.functions as F
joined_df = joined_df.withColumn(
"preferred_domain",
F.coalesce(dataframe_a.domain, dataframe_b.domain)
)
joined_df = joined_df.drop(dataframe_a.domain).drop(dataframe_b.domain)
给予
+----------------+----------------+
| user_id|preferred_domain|
+----------------+----------------+
| josh| wanadoo.fr|
| samantha| randomn.fr|
| bob| eidsiva.net|
| dylan| vodafone.it|
| ryan| chicks.it|
+----------------+----------------+