迭代两个数据帧，比较并更改 pandas 或 pyspark 中的值

Question

我正尝试在 pandas 中进行锻炼。

我有两个数据框。我需要比较两个数据帧之间的几列，如果比较成功，则更改第一个数据帧中一列的值。

数据框 1：

Article    Country   Colour    Buy
Pants      Germany   Red       0
Pull       Poland    Blue      0

最初我所有的文章都将标志 'Buy' 设置为零。我的数据框 2 看起来像：

Article    Origin    Colour   
Pull       Poland    Blue    
Dress      Italy     Red

我想检查文章、country/origin 和颜色列是否匹配（因此检查我是否可以在数据框二中找到数据框 1 中的每篇文章），如果是，我想放置标志'Buy' 到 1.

我尝试使用 pyspark 遍历两个数据帧，但 pyspark 数据帧不可迭代。我考虑过在 pandas 中这样做，但显然在迭代过程中更改值是一种不好的做法。

pyspark 或 pandas 中的哪些代码可以完成我需要做的事情？

谢谢！

Answer 1

merge 与 indicator 然后 map 值。确保 drop_duplicates 在右侧帧的合并键上，以便合并结果始终与原始长度相同，并重命名，这样我们在合并后不会重复相同的信息。无需预定义的 0 列。

df1 = df1.drop(columns='Buy')
df1 = df1.merge(df2.drop_duplicates().rename(columns={'Origin': 'Country'}), 
                indicator='Buy', how='left')
df1['Buy'] = df1['Buy'].map({'left_only': 0, 'both': 1}).astype(int)

  Article  Country Colour  Buy
0   Pants  Germany    Red    0
1    Pull   Poland   Blue    1

迭代两个数据帧，比较并更改 pandas 或 pyspark 中的值

Iterate two dataframes, compare and change a value in pandas or pyspark

pandas

pyspark

pyspark-dataframes