Pyspark,如何附加数据框但从特定数据框中删除重复项
Pyspark, how to append a dataframe but remove duplicates from a specific one
我每天从源中检索一次数据,但由于某些延迟,我需要检索比上一次检索的最新数据更早一点的数据。这会导致一些重叠,我想要实现的是删除旧数据框中具有新数据框中时间戳的行,以便我只保留最近检索到的信息。
数据示例:
df_old.show()
+-------------------+------------------+------------------+
| index| A| B|
+-------------------+------------------+------------------+
|2013-01-01 00:00:00| 6.251379599223777| 10.23320055553287|
|2013-01-01 00:10:00| 6.245690342672945| 10.22296550603164|
|2013-01-01 00:20:00|6.2534029157968956|10.221452136599193|
|2013-01-01 00:30:00| 6.247532408988978|10.212423634472028|
|2013-01-01 00:40:00| 6.253508510989639|10.194494950388954|
|2013-01-01 00:50:00| 6.247517363773414|10.200814690766375|
|2013-01-01 01:00:00| 6.25381864046542|10.192425005184585|
|2013-01-01 01:10:00| 6.250060498528904|10.181246688945123|
|2013-01-01 01:20:00| 6.254461614739839| 10.18021442155982|
|2013-01-01 01:30:00| 6.233226501275796|10.180681886095698|
|2013-01-01 01:40:00| 6.252799353320566|10.169008765187861|
|2013-01-01 01:50:00| 6.248423707837854| 10.16567354928804|
|2013-01-01 02:00:00| 6.253744374163072|10.161773904107136|
|2013-01-01 02:10:00| 6.238242597088755|10.151641862402213|
+-------------------+------------------+------------------+
df_new.show()
+-------------------+------------------+------------------+
| index| A| B|
+-------------------+------------------+------------------+
|2013-01-01 01:30:00| 7 | 20 |
|2013-01-01 01:40:00| 7 | 20 |
|2013-01-01 01:50:00| 7 | 20 |
|2013-01-01 02:00:00| 7 | 20 |
|2013-01-01 02:10:00| 7 | 20 |
|2013-01-01 02:20:00| 6.24546611958182| 10.14886792741417|
|2013-01-01 02:30:00| 6.240802043802097| 10.15267232231782|
|2013-01-01 02:40:00| 6.249921473522189|10.139161473568803|
|2013-01-01 02:50:00|6.2219054718011515| 10.11521891469772|
|2013-01-01 03:00:00| 6.247084671443932|10.088592826542145|
|2013-01-01 03:10:00| 6.24950717588649|10.065343892142995|
+-------------------+------------------+------------------+
我想要实现的是这个结果,其中只保留了新 df 的重叠结果:
df_combined.show()
+-------------------+------------------+------------------+
| index| A| B|
+-------------------+------------------+------------------+
|2013-01-01 00:00:00| 6.251379599223777| 10.23320055553287|
|2013-01-01 00:10:00| 6.245690342672945| 10.22296550603164|
|2013-01-01 00:20:00|6.2534029157968956|10.221452136599193|
|2013-01-01 00:30:00| 6.247532408988978|10.212423634472028|
|2013-01-01 00:40:00| 6.253508510989639|10.194494950388954|
|2013-01-01 00:50:00| 6.247517363773414|10.200814690766375|
|2013-01-01 01:00:00| 6.25381864046542|10.192425005184585|
|2013-01-01 01:10:00| 6.250060498528904|10.181246688945123|
|2013-01-01 01:20:00| 6.254461614739839| 10.18021442155982|
|2013-01-01 01:30:00| 7 | 20 |
|2013-01-01 01:40:00| 7 | 20 |
|2013-01-01 01:50:00| 7 | 20 |
|2013-01-01 02:00:00| 7 | 20 |
|2013-01-01 02:10:00| 7 | 20 |
|2013-01-01 02:20:00| 6.24546611958182| 10.14886792741417|
|2013-01-01 02:30:00| 6.240802043802097| 10.15267232231782|
|2013-01-01 02:40:00| 6.249921473522189|10.139161473568803|
|2013-01-01 02:50:00|6.2219054718011515| 10.11521891469772|
|2013-01-01 03:00:00| 6.247084671443932|10.088592826542145|
|2013-01-01 03:10:00| 6.24950717588649|10.065343892142995|
+-------------------+------------------+------------------+
是否有任何简单的内置函数可以实现此结果?
使用outer
加入。
df1.join(df2, ['index'], 'outer') \
.select('index', coalesce(df2.A, df1.A), coalesce(df2.B, df1.B)).toDF('index', 'A', 'B') \
.orderBy('index').show(20, False)
+-------------------+------------------+------------------+
|index |A |B |
+-------------------+------------------+------------------+
|2013-01-01 00:00:00|6.251379599223777 |10.23320055553287 |
|2013-01-01 00:10:00|6.245690342672945 |10.22296550603164 |
|2013-01-01 00:20:00|6.2534029157968956|10.221452136599193|
|2013-01-01 00:30:00|6.247532408988978 |10.212423634472028|
|2013-01-01 00:40:00|6.253508510989639 |10.194494950388954|
|2013-01-01 00:50:00|6.247517363773414 |10.200814690766375|
|2013-01-01 01:00:00|6.25381864046542 |10.192425005184585|
|2013-01-01 01:10:00|6.250060498528904 |10.181246688945123|
|2013-01-01 01:20:00|6.254461614739839 |10.18021442155982 |
|2013-01-01 01:30:00|7.0 |20.0 |
|2013-01-01 01:40:00|7.0 |20.0 |
|2013-01-01 01:50:00|7.0 |20.0 |
|2013-01-01 02:00:00|7.0 |20.0 |
|2013-01-01 02:10:00|7.0 |20.0 |
|2013-01-01 02:20:00|6.24546611958182 |10.14886792741417 |
|2013-01-01 02:30:00|6.240802043802097 |10.15267232231782 |
|2013-01-01 02:40:00|6.249921473522189 |10.139161473568803|
|2013-01-01 02:50:00|6.2219054718011515|10.11521891469772 |
|2013-01-01 03:00:00|6.247084671443932 |10.088592826542145|
|2013-01-01 03:10:00|6.24950717588649 |10.065343892142995|
+-------------------+------------------+------------------+
我每天从源中检索一次数据,但由于某些延迟,我需要检索比上一次检索的最新数据更早一点的数据。这会导致一些重叠,我想要实现的是删除旧数据框中具有新数据框中时间戳的行,以便我只保留最近检索到的信息。
数据示例:
df_old.show()
+-------------------+------------------+------------------+
| index| A| B|
+-------------------+------------------+------------------+
|2013-01-01 00:00:00| 6.251379599223777| 10.23320055553287|
|2013-01-01 00:10:00| 6.245690342672945| 10.22296550603164|
|2013-01-01 00:20:00|6.2534029157968956|10.221452136599193|
|2013-01-01 00:30:00| 6.247532408988978|10.212423634472028|
|2013-01-01 00:40:00| 6.253508510989639|10.194494950388954|
|2013-01-01 00:50:00| 6.247517363773414|10.200814690766375|
|2013-01-01 01:00:00| 6.25381864046542|10.192425005184585|
|2013-01-01 01:10:00| 6.250060498528904|10.181246688945123|
|2013-01-01 01:20:00| 6.254461614739839| 10.18021442155982|
|2013-01-01 01:30:00| 6.233226501275796|10.180681886095698|
|2013-01-01 01:40:00| 6.252799353320566|10.169008765187861|
|2013-01-01 01:50:00| 6.248423707837854| 10.16567354928804|
|2013-01-01 02:00:00| 6.253744374163072|10.161773904107136|
|2013-01-01 02:10:00| 6.238242597088755|10.151641862402213|
+-------------------+------------------+------------------+
df_new.show()
+-------------------+------------------+------------------+
| index| A| B|
+-------------------+------------------+------------------+
|2013-01-01 01:30:00| 7 | 20 |
|2013-01-01 01:40:00| 7 | 20 |
|2013-01-01 01:50:00| 7 | 20 |
|2013-01-01 02:00:00| 7 | 20 |
|2013-01-01 02:10:00| 7 | 20 |
|2013-01-01 02:20:00| 6.24546611958182| 10.14886792741417|
|2013-01-01 02:30:00| 6.240802043802097| 10.15267232231782|
|2013-01-01 02:40:00| 6.249921473522189|10.139161473568803|
|2013-01-01 02:50:00|6.2219054718011515| 10.11521891469772|
|2013-01-01 03:00:00| 6.247084671443932|10.088592826542145|
|2013-01-01 03:10:00| 6.24950717588649|10.065343892142995|
+-------------------+------------------+------------------+
我想要实现的是这个结果,其中只保留了新 df 的重叠结果:
df_combined.show()
+-------------------+------------------+------------------+
| index| A| B|
+-------------------+------------------+------------------+
|2013-01-01 00:00:00| 6.251379599223777| 10.23320055553287|
|2013-01-01 00:10:00| 6.245690342672945| 10.22296550603164|
|2013-01-01 00:20:00|6.2534029157968956|10.221452136599193|
|2013-01-01 00:30:00| 6.247532408988978|10.212423634472028|
|2013-01-01 00:40:00| 6.253508510989639|10.194494950388954|
|2013-01-01 00:50:00| 6.247517363773414|10.200814690766375|
|2013-01-01 01:00:00| 6.25381864046542|10.192425005184585|
|2013-01-01 01:10:00| 6.250060498528904|10.181246688945123|
|2013-01-01 01:20:00| 6.254461614739839| 10.18021442155982|
|2013-01-01 01:30:00| 7 | 20 |
|2013-01-01 01:40:00| 7 | 20 |
|2013-01-01 01:50:00| 7 | 20 |
|2013-01-01 02:00:00| 7 | 20 |
|2013-01-01 02:10:00| 7 | 20 |
|2013-01-01 02:20:00| 6.24546611958182| 10.14886792741417|
|2013-01-01 02:30:00| 6.240802043802097| 10.15267232231782|
|2013-01-01 02:40:00| 6.249921473522189|10.139161473568803|
|2013-01-01 02:50:00|6.2219054718011515| 10.11521891469772|
|2013-01-01 03:00:00| 6.247084671443932|10.088592826542145|
|2013-01-01 03:10:00| 6.24950717588649|10.065343892142995|
+-------------------+------------------+------------------+
是否有任何简单的内置函数可以实现此结果?
使用outer
加入。
df1.join(df2, ['index'], 'outer') \
.select('index', coalesce(df2.A, df1.A), coalesce(df2.B, df1.B)).toDF('index', 'A', 'B') \
.orderBy('index').show(20, False)
+-------------------+------------------+------------------+
|index |A |B |
+-------------------+------------------+------------------+
|2013-01-01 00:00:00|6.251379599223777 |10.23320055553287 |
|2013-01-01 00:10:00|6.245690342672945 |10.22296550603164 |
|2013-01-01 00:20:00|6.2534029157968956|10.221452136599193|
|2013-01-01 00:30:00|6.247532408988978 |10.212423634472028|
|2013-01-01 00:40:00|6.253508510989639 |10.194494950388954|
|2013-01-01 00:50:00|6.247517363773414 |10.200814690766375|
|2013-01-01 01:00:00|6.25381864046542 |10.192425005184585|
|2013-01-01 01:10:00|6.250060498528904 |10.181246688945123|
|2013-01-01 01:20:00|6.254461614739839 |10.18021442155982 |
|2013-01-01 01:30:00|7.0 |20.0 |
|2013-01-01 01:40:00|7.0 |20.0 |
|2013-01-01 01:50:00|7.0 |20.0 |
|2013-01-01 02:00:00|7.0 |20.0 |
|2013-01-01 02:10:00|7.0 |20.0 |
|2013-01-01 02:20:00|6.24546611958182 |10.14886792741417 |
|2013-01-01 02:30:00|6.240802043802097 |10.15267232231782 |
|2013-01-01 02:40:00|6.249921473522189 |10.139161473568803|
|2013-01-01 02:50:00|6.2219054718011515|10.11521891469772 |
|2013-01-01 03:00:00|6.247084671443932 |10.088592826542145|
|2013-01-01 03:10:00|6.24950717588649 |10.065343892142995|
+-------------------+------------------+------------------+