比较两个 spark 数据帧之间的数据,如果匹配则填充 PASS,在相应的列处填充 FAIL

Comparing data between two spark dataframes and populating PASS if match and FAIL at corresponding colums

我有两个 spark 数据框,如下所示

Source1 data

Source2 data

我正在使用 pyspark python 使用 Snapshot_Date 作为键列比较两个来源之间的数据,并希望在另一个数据框中显示结果,如下所示

Compare 颜色编码是为了便于理解,不需要

提前致谢

您可以使用 spark-extension package mentioned in this answer

您可以修改 result 数据帧以适当地获取最终数据帧。

from gresearch.spark.diff import *

sc = SparkContext('local')
sqlContext = SQLContext(sc)

data1 = [
        ("Source1",20201116, 436039, 123, 222, 333,0, 555),
        ("Source1",20201117,436034, 234, 34, 7, 5, 678)
      ]

df1Columns = ["Source","Snapshot_Date","REC_COUNT","Col1","Col2","Col3","Col4","Col5"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)

print("Source1 dataframe")
df1.show(truncate=False)

data2 = [
        ("Source2", 20201116,436039,234,234,333,0,555),
        ("Source2", 20201117,436034,234,5,7,5,678)
      ]

df2Columns = ["Source", "Snapshot_Date","REC_COUNT","Col1","Col2","Col3","Col4","Col5"]
df2 = sqlContext.createDataFrame(data=data2, schema = df2Columns)

print(" Source2 dataframe")
df2.show(truncate=False)

options = DiffOptions().with_change_column("changes")
result = df1.diff_with_options(df2, options, 'Snapshot_Date')

result.select("Snapshot_Date", "diff","changes").show(truncate=False)

输出结果如图。 changes 列列出了存在更改的列。

Source1 dataframe
+-------+-------------+---------+----+----+----+----+----+
|Source |Snapshot_Date|REC_COUNT|Col1|Col2|Col3|Col4|Col5|
+-------+-------------+---------+----+----+----+----+----+
|Source1|20201116     |436039   |123 |222 |333 |0   |555 |
|Source1|20201117     |436034   |234 |34  |7   |5   |678 |
+-------+-------------+---------+----+----+----+----+----+

 Source2 dataframe
+-------+-------------+---------+----+----+----+----+----+
|Source |Snapshot_Date|REC_COUNT|Col1|Col2|Col3|Col4|Col5|
+-------+-------------+---------+----+----+----+----+----+
|Source2|20201116     |436039   |234 |234 |333 |0   |555 |
|Source2|20201117     |436034   |234 |5   |7   |5   |678 |
+-------+-------------+---------+----+----+----+----+----+

+-------------+----+--------------------+
|Snapshot_Date|diff|changes             |
+-------------+----+--------------------+
|20201116     |C   |[Source, Col1, Col2]|
|20201117     |C   |[Source, Col2]      |
+-------------+----+--------------------+