如何在 pyspark 中连接两个具有多个重叠的数据帧

Question

您好，我有一个包含多个家庭的数据集，其中家庭中的所有人都在两个数据源之间进行了匹配。因此，数据框由一个 'household' col 和两个 person cols（每个数据源一个）组成。但是，有些人（例如下面的 Jonathan 或 Peter）无法匹配，因此第二人称栏为空白。

Household	Person_source_A	Person_source_B
1	Oliver	Oliver
1	Jonathan
1	Amy	Amy
2	David	Dave
2	Mary	Mary
3	Lizzie	Elizabeth
3	Peter

由于数据框很大，我的目标是对不匹配的个体进行抽样，然后输出一个 df，其中包含家庭中的所有人，其中只存在样本不匹配的人。也就是说，我的随机样本包括 Oliver 但不包括 Peter，那么我只会在输出中输入家庭 1。

我的问题是我已经过滤以获取样本，但现在无法取得进展。 join、agg/groupBy... 的某种组合会起作用，但我正在努力。我在采样的不匹配名称中添加了一个标志来识别它们，我认为这很有帮助...

我的代码：

# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())

# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1)

# add flag of sampled unmatched persons
df_unmatched_sample = df_unmatched.withColumn('sample_flag', lit('1'))

Answer 1

因为它与您的意图有关：

I just want to reduce my dataframe to only show the full households of households where an unmatched person exists that has been selected by a random sample out of all unmatched people

使用您现有的方法，您可以对示例记录的 Household 使用联接

# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())

# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1).select("Household").distinct()

desired_df = df.join(df_unmatched_sample,["Household"],"inner")

编辑 1

回复 op 的评论：

Is there a slightly different way that keeps a flag to identify the sampled unmatched person (as there are some households with more than one unmatched person)?

在将标记列添加到样本后对现有数据集进行左连接可能会帮助您实现此目的，例如：

# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())

# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1).withColumn('sample_flag', lit('1'))

desired_df = (
    df.alias("dfo").join(
        df_unmatched_sample.alias("dfu"),
        [
            col("dfo.Household")==col("dfu.Household") , 
            col("dfo.per_A")==col("dfu.per_A"),
            col("dfo.per_B").isNull()
        ],
        "left"
    )
)

如何在 pyspark 中连接两个具有多个重叠的数据帧

How join two dataframes with multiple overlap in pyspark

join

group-by

apache-spark

apache-spark-sql

pyspark

编辑 1