在不丢失其他列的情况下显示来自两个不同数据帧的两列的两个值的差异

display the difference of two values of two columns from two different dataframes without losing the other columns

我有两个具有不同“d”值但具有相同“a”和“b”值的数据帧

这是 df1

df1 = spark.createDataFrame([
    ('c', 'd', 8),
    ('e', 'f', 8),
    ('c', 'j', 9),
], ['a', 'b', 'd'])
​
df1.show()
+---+---+---+
|  a|  b|  d|
+---+---+---+
|  c|  d|  8|
|  e|  f|  8|
|  c|  j|  9|
+---+---+---+

这是 df 2

df2 = spark.createDataFrame([
    ('c', 'd', 7),
    ('e', 'f', 3),
    ('c', 'j', 8),
], ['a', 'b', 'd'])
df2.show()
+---+---+---+
|  a|  b|  d|
+---+---+---+
|  c|  d|  7|
|  e|  f|  3|
|  c|  j|  8|
+---+---+---+

我想获得“d”列的值之间的差异,但我也想保留“a”和“b”列

df3 
+---+---+---+
|  a|  b|  d|
+---+---+---+
|  c|  d|  1|
|  e|  f|  5|
|  c|  j|  1|
+---+---+---+

我尝试在两个数据帧之间进行减法,但没有成功

df1.subtract(df2).show()
+---+---+---+
|  a|  b|  d|
+---+---+---+
|  c|  d|  8|
|  e|  f|  8|
|  c|  j|  9|
+---+---+---+

以下是您的操作方法:

df3 = df1.join(df2, on = ['b', 'a'], how = 'outer').select('a', 'b', (df1.d - df2.d).alias('diff'))

df3.show()