比较两个 spark 数据帧之间的数据,如果匹配则填充 PASS,在相应的列处填充 FAIL
Comparing data between two spark dataframes and populating PASS if match and FAIL at corresponding colums
我有两个 spark 数据框,如下所示
Source1 data
Source2 data
我正在使用 pyspark python 使用 Snapshot_Date 作为键列比较两个来源之间的数据,并希望在另一个数据框中显示结果,如下所示
Compare
颜色编码是为了便于理解,不需要
提前致谢
您可以使用 spark-extension package mentioned in this answer
您可以修改 result
数据帧以适当地获取最终数据帧。
from gresearch.spark.diff import *
sc = SparkContext('local')
sqlContext = SQLContext(sc)
data1 = [
("Source1",20201116, 436039, 123, 222, 333,0, 555),
("Source1",20201117,436034, 234, 34, 7, 5, 678)
]
df1Columns = ["Source","Snapshot_Date","REC_COUNT","Col1","Col2","Col3","Col4","Col5"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)
print("Source1 dataframe")
df1.show(truncate=False)
data2 = [
("Source2", 20201116,436039,234,234,333,0,555),
("Source2", 20201117,436034,234,5,7,5,678)
]
df2Columns = ["Source", "Snapshot_Date","REC_COUNT","Col1","Col2","Col3","Col4","Col5"]
df2 = sqlContext.createDataFrame(data=data2, schema = df2Columns)
print(" Source2 dataframe")
df2.show(truncate=False)
options = DiffOptions().with_change_column("changes")
result = df1.diff_with_options(df2, options, 'Snapshot_Date')
result.select("Snapshot_Date", "diff","changes").show(truncate=False)
输出结果如图。 changes
列列出了存在更改的列。
Source1 dataframe
+-------+-------------+---------+----+----+----+----+----+
|Source |Snapshot_Date|REC_COUNT|Col1|Col2|Col3|Col4|Col5|
+-------+-------------+---------+----+----+----+----+----+
|Source1|20201116 |436039 |123 |222 |333 |0 |555 |
|Source1|20201117 |436034 |234 |34 |7 |5 |678 |
+-------+-------------+---------+----+----+----+----+----+
Source2 dataframe
+-------+-------------+---------+----+----+----+----+----+
|Source |Snapshot_Date|REC_COUNT|Col1|Col2|Col3|Col4|Col5|
+-------+-------------+---------+----+----+----+----+----+
|Source2|20201116 |436039 |234 |234 |333 |0 |555 |
|Source2|20201117 |436034 |234 |5 |7 |5 |678 |
+-------+-------------+---------+----+----+----+----+----+
+-------------+----+--------------------+
|Snapshot_Date|diff|changes |
+-------------+----+--------------------+
|20201116 |C |[Source, Col1, Col2]|
|20201117 |C |[Source, Col2] |
+-------------+----+--------------------+
我有两个 spark 数据框,如下所示
Source1 data
Source2 data
我正在使用 pyspark python 使用 Snapshot_Date 作为键列比较两个来源之间的数据,并希望在另一个数据框中显示结果,如下所示
Compare 颜色编码是为了便于理解,不需要
提前致谢
您可以使用 spark-extension package mentioned in this answer
您可以修改 result
数据帧以适当地获取最终数据帧。
from gresearch.spark.diff import *
sc = SparkContext('local')
sqlContext = SQLContext(sc)
data1 = [
("Source1",20201116, 436039, 123, 222, 333,0, 555),
("Source1",20201117,436034, 234, 34, 7, 5, 678)
]
df1Columns = ["Source","Snapshot_Date","REC_COUNT","Col1","Col2","Col3","Col4","Col5"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)
print("Source1 dataframe")
df1.show(truncate=False)
data2 = [
("Source2", 20201116,436039,234,234,333,0,555),
("Source2", 20201117,436034,234,5,7,5,678)
]
df2Columns = ["Source", "Snapshot_Date","REC_COUNT","Col1","Col2","Col3","Col4","Col5"]
df2 = sqlContext.createDataFrame(data=data2, schema = df2Columns)
print(" Source2 dataframe")
df2.show(truncate=False)
options = DiffOptions().with_change_column("changes")
result = df1.diff_with_options(df2, options, 'Snapshot_Date')
result.select("Snapshot_Date", "diff","changes").show(truncate=False)
输出结果如图。 changes
列列出了存在更改的列。
Source1 dataframe
+-------+-------------+---------+----+----+----+----+----+
|Source |Snapshot_Date|REC_COUNT|Col1|Col2|Col3|Col4|Col5|
+-------+-------------+---------+----+----+----+----+----+
|Source1|20201116 |436039 |123 |222 |333 |0 |555 |
|Source1|20201117 |436034 |234 |34 |7 |5 |678 |
+-------+-------------+---------+----+----+----+----+----+
Source2 dataframe
+-------+-------------+---------+----+----+----+----+----+
|Source |Snapshot_Date|REC_COUNT|Col1|Col2|Col3|Col4|Col5|
+-------+-------------+---------+----+----+----+----+----+
|Source2|20201116 |436039 |234 |234 |333 |0 |555 |
|Source2|20201117 |436034 |234 |5 |7 |5 |678 |
+-------+-------------+---------+----+----+----+----+----+
+-------------+----+--------------------+
|Snapshot_Date|diff|changes |
+-------------+----+--------------------+
|20201116 |C |[Source, Col1, Col2]|
|20201117 |C |[Source, Col2] |
+-------------+----+--------------------+