Pyspark 中的 GroupBy 操作

GroupBy operation in Pyspark

我有一个数据框,其中我根据纬度和经度进行了半正弦距离计算。我想找到 min.distance、id 和 store_code。

数据框看起来像 -

+---+---------+---------+-----+-----+--------+---------+
| id| user_lat| user_lon|s_lat|s_lon|store_no|  dist_km|
+---+---------+---------+-----+-----+--------+---------+
|  1|13.031885|80.235574|29.91|73.88|      22| 1988.047|
|  1|13.031885|80.235574|28.57|77.33|      23| 1754.225|
|  1|13.031885|80.235574|26.86|80.95|      24|1539.8511|
|  2|19.099819|72.915288|29.91|73.88|      22|1206.3154|
|  3| 22.22698| 84.83607|29.91|73.88|      22|1387.3323|
|  2|19.099819|72.915288|28.57|77.33|      23|1144.7731|
|  2|19.099819|72.915288|26.86|80.95|      24|1191.7048|
|  3| 22.22698| 84.83607|28.57|77.33|      23|1032.1859|
|  3| 22.22698| 84.83607|26.86|80.95|      24| 648.0673|
+---+---------+---------+-----+-----+--------+---------+

I want my final df should be - 

+---+---------+---------+-----+-----+--------+---------+
| id| user_lat| user_lon|s_lat|s_lon|store_no|  dist_km|
+---+---------+---------+-----+-----+--------+---------+
|  1|13.031885|80.235574|26.86|80.95|      24|1539.8511|
|  2|19.099819|72.915288|28.57|77.33|      23|1144.7731|
|  3| 22.22698| 84.83607|26.86|80.95|      24| 648.0673|
+---+---------+---------+-----+-----+--------+---------+

这应该适合你 -

用最小距离创建nearest_store_df

import pyspark.sql.functions as psf

nearest_store_df = df\
    .groupBy('id')\
    .agg(psf.min('dist_km').alias('min_dist_km'))

现在用原始数据框加入 nearest_store_df

df\
    .join(nearest_store_df, df.dist_km == nearest_store_df.min_dist_km,'inner')\
    .show()