Pyspark 中的 GroupBy 操作
GroupBy operation in Pyspark
我有一个数据框,其中我根据纬度和经度进行了半正弦距离计算。我想找到 min.distance、id 和 store_code。
数据框看起来像 -
+---+---------+---------+-----+-----+--------+---------+
| id| user_lat| user_lon|s_lat|s_lon|store_no| dist_km|
+---+---------+---------+-----+-----+--------+---------+
| 1|13.031885|80.235574|29.91|73.88| 22| 1988.047|
| 1|13.031885|80.235574|28.57|77.33| 23| 1754.225|
| 1|13.031885|80.235574|26.86|80.95| 24|1539.8511|
| 2|19.099819|72.915288|29.91|73.88| 22|1206.3154|
| 3| 22.22698| 84.83607|29.91|73.88| 22|1387.3323|
| 2|19.099819|72.915288|28.57|77.33| 23|1144.7731|
| 2|19.099819|72.915288|26.86|80.95| 24|1191.7048|
| 3| 22.22698| 84.83607|28.57|77.33| 23|1032.1859|
| 3| 22.22698| 84.83607|26.86|80.95| 24| 648.0673|
+---+---------+---------+-----+-----+--------+---------+
I want my final df should be -
+---+---------+---------+-----+-----+--------+---------+
| id| user_lat| user_lon|s_lat|s_lon|store_no| dist_km|
+---+---------+---------+-----+-----+--------+---------+
| 1|13.031885|80.235574|26.86|80.95| 24|1539.8511|
| 2|19.099819|72.915288|28.57|77.33| 23|1144.7731|
| 3| 22.22698| 84.83607|26.86|80.95| 24| 648.0673|
+---+---------+---------+-----+-----+--------+---------+
这应该适合你 -
用最小距离创建nearest_store_df
。
import pyspark.sql.functions as psf
nearest_store_df = df\
.groupBy('id')\
.agg(psf.min('dist_km').alias('min_dist_km'))
现在用原始数据框加入 nearest_store_df
。
df\
.join(nearest_store_df, df.dist_km == nearest_store_df.min_dist_km,'inner')\
.show()
我有一个数据框,其中我根据纬度和经度进行了半正弦距离计算。我想找到 min.distance、id 和 store_code。
数据框看起来像 -
+---+---------+---------+-----+-----+--------+---------+
| id| user_lat| user_lon|s_lat|s_lon|store_no| dist_km|
+---+---------+---------+-----+-----+--------+---------+
| 1|13.031885|80.235574|29.91|73.88| 22| 1988.047|
| 1|13.031885|80.235574|28.57|77.33| 23| 1754.225|
| 1|13.031885|80.235574|26.86|80.95| 24|1539.8511|
| 2|19.099819|72.915288|29.91|73.88| 22|1206.3154|
| 3| 22.22698| 84.83607|29.91|73.88| 22|1387.3323|
| 2|19.099819|72.915288|28.57|77.33| 23|1144.7731|
| 2|19.099819|72.915288|26.86|80.95| 24|1191.7048|
| 3| 22.22698| 84.83607|28.57|77.33| 23|1032.1859|
| 3| 22.22698| 84.83607|26.86|80.95| 24| 648.0673|
+---+---------+---------+-----+-----+--------+---------+
I want my final df should be -
+---+---------+---------+-----+-----+--------+---------+
| id| user_lat| user_lon|s_lat|s_lon|store_no| dist_km|
+---+---------+---------+-----+-----+--------+---------+
| 1|13.031885|80.235574|26.86|80.95| 24|1539.8511|
| 2|19.099819|72.915288|28.57|77.33| 23|1144.7731|
| 3| 22.22698| 84.83607|26.86|80.95| 24| 648.0673|
+---+---------+---------+-----+-----+--------+---------+
这应该适合你 -
用最小距离创建nearest_store_df
。
import pyspark.sql.functions as psf
nearest_store_df = df\
.groupBy('id')\
.agg(psf.min('dist_km').alias('min_dist_km'))
现在用原始数据框加入 nearest_store_df
。
df\
.join(nearest_store_df, df.dist_km == nearest_store_df.min_dist_km,'inner')\
.show()