如何在 pyspark 中连接两个具有多个重叠的数据帧

您好,我有一个包含多个家庭的数据集,其中家庭中的所有人都在两个数据源之间进行了匹配。因此,数据框由一个 'household' col 和两个 person cols(每个数据源一个)组成。但是,有些人(例如下面的 Jonathan 或 Peter)无法匹配,因此第二人称栏为空白。

Household Person_source_A Person_source_B
1 Oliver Oliver
1 Jonathan
1 Amy Amy
2 David Dave
2 Mary Mary
3 Lizzie Elizabeth
3 Peter

由于数据框很大,我的目标是对不匹配的个体进行抽样,然后输出一个 df,其中包含家庭中的所有人,其中只存在样本不匹配的人。也就是说,我的随机样本包括 Oliver 但不包括 Peter,那么我只会在输出中输入家庭 1。

我的问题是我已经过滤以获取样本,但现在无法取得进展。 join、agg/groupBy... 的某种组合会起作用,但我正在努力。我在采样的不匹配名称中添加了一个标志来识别它们,我认为这很有帮助...


# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())

# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1)

# add flag of sampled unmatched persons
df_unmatched_sample = df_unmatched.withColumn('sample_flag', lit('1'))


I just want to reduce my dataframe to only show the full households of households where an unmatched person exists that has been selected by a random sample out of all unmatched people

使用您现有的方法,您可以对示例记录的 Household 使用联接

# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())

# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1).select("Household").distinct()

desired_df = df.join(df_unmatched_sample,["Household"],"inner")

编辑 1

回复 op 的评论:

Is there a slightly different way that keeps a flag to identify the sampled unmatched person (as there are some households with more than one unmatched person)?


# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())

# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1).withColumn('sample_flag', lit('1'))

desired_df = (
            col("dfo.Household")==col("dfu.Household") , 