计算pyspark数据框中的地理距离
Calculate the geographical distance in pyspark dataframe
我的数据框:
DF = spark.createDataFrame([[114.038696, 22.5315, 114.047302, 22.531799], [ 114.027901, 22.5228, 114.026299, 22.5238], [ 114.026299, 22.5238,114.024597,22.5271], [114.024597, 22.5271,114.024696,22.527201]], list('ABCD'))
DF.show()
+----------+-------+----------+---------+
| A| B| C| D|
+----------+-------+----------+---------+
|114.038696|22.5315|114.047302|22.531799|
|114.027901|22.5228|114.026299| 22.5238|
|114.026299|22.5238|114.024597| 22.5271|
|114.024597|22.5271|114.024696|22.527201|
+----------+-------+----------+---------+
(A, B)
&(C, D)
为两点坐标;
column A
&column C
是纬度;
column B
& column D
为经度;
我想计算两点之间的地理距离
我尝试:
from geopy.distance import geodesic
DF = DF.withColumn('Lengths/m', geodesic((['B'],['A']), (['D'],['C'])).m)
然后我得到错误:
TypeError: float() argument must be a string or a number, not 'list'
我应该怎么做才能成功计算地理距离?
您需要自定义一个user-defined-function:
from geopy.distance import geodesic
import pyspark.sql.functions as F
@F.udf(returnType=FloatType())
def geodesic_udf(a, b):
return geodesic(a, b).m
DF = DF.withColumn('Lengths/m', geodesic_udf(F.array("B", "A"), F.array("D", "C")))
DF.show()
#+----------+-------+----------+---------+---------+
#|A |B |C |D |Lengths/m|
#+----------+-------+----------+---------+---------+
#|114.038696|22.5315|114.047302|22.531799|885.94244|
#|114.027901|22.5228|114.026299|22.5238 |198.55937|
#|114.026299|22.5238|114.024597|22.5271 |405.21692|
#|114.024597|22.5271|114.024696|22.527201|15.126849|
#+----------+-------+----------+---------+---------+
我的数据框:
DF = spark.createDataFrame([[114.038696, 22.5315, 114.047302, 22.531799], [ 114.027901, 22.5228, 114.026299, 22.5238], [ 114.026299, 22.5238,114.024597,22.5271], [114.024597, 22.5271,114.024696,22.527201]], list('ABCD'))
DF.show()
+----------+-------+----------+---------+
| A| B| C| D|
+----------+-------+----------+---------+
|114.038696|22.5315|114.047302|22.531799|
|114.027901|22.5228|114.026299| 22.5238|
|114.026299|22.5238|114.024597| 22.5271|
|114.024597|22.5271|114.024696|22.527201|
+----------+-------+----------+---------+
(A, B)
&(C, D)
为两点坐标;
column A
&column C
是纬度;
column B
& column D
为经度;
我想计算两点之间的地理距离
我尝试:
from geopy.distance import geodesic
DF = DF.withColumn('Lengths/m', geodesic((['B'],['A']), (['D'],['C'])).m)
然后我得到错误:
TypeError: float() argument must be a string or a number, not 'list'
我应该怎么做才能成功计算地理距离?
您需要自定义一个user-defined-function:
from geopy.distance import geodesic
import pyspark.sql.functions as F
@F.udf(returnType=FloatType())
def geodesic_udf(a, b):
return geodesic(a, b).m
DF = DF.withColumn('Lengths/m', geodesic_udf(F.array("B", "A"), F.array("D", "C")))
DF.show()
#+----------+-------+----------+---------+---------+
#|A |B |C |D |Lengths/m|
#+----------+-------+----------+---------+---------+
#|114.038696|22.5315|114.047302|22.531799|885.94244|
#|114.027901|22.5228|114.026299|22.5238 |198.55937|
#|114.026299|22.5238|114.024597|22.5271 |405.21692|
#|114.024597|22.5271|114.024696|22.527201|15.126849|
#+----------+-------+----------+---------+---------+