在 Python 中获取经纬度集群的中心点
Getting the center point of a cluster for latitude and longitude in Python
我有一个坐标列表,其区域映射如下
df=pd.DataFrame({'user_id':[55,55,356,356,356,356,632,752,938,963,963,1226,2663,2663,2663,2663,2663,3183,3197,3344,3387,3387,3387,3387,3396,3515,3536,3570,3819,3883,3883,3883,3883,3883,3883,3883,3883,3883,3883,3883,3883,4584,4594,4713,4931,4931,5026,5487,5487,5575,5575,5575,5602,5639,5639,5639,5639,5783,5783,5783,5783,5783,5801,6373,6718,6886,6886,7055,7055,7608,7608,7777,8186,8186,8307,8712,9271,9896,9991,9991,9991,],
'latitude':[13.2633943,13.2633964,12.809677124,12.8099212646,12.8100585938,12.810065981,12.9440132,12.2958104,12.5265661,13.0767648,13.0853577,12.6301221,12.8558120728,12.8558349609,12.8558654785,12.8558807373,12.8558959961,12.9141417,13.0696411133,13.0708333,10.7904833,10.7904833,10.7904833,12.884091,13.0694428,13.204637,12.6922086,13.0767648,13.3489958,12.8653798,12.8654014,12.8654124,12.8654448,12.8654521,12.8654658,12.8654733,12.8654815,12.8654844,12.8655367,12.8655376,12.865576,12.4025539,13.1986348,12.9548317,11.664325,11.6690603,13.0656551,13.1137554,13.1137978,12.770418,12.9141417,12.9141417,15.3530727,12.8285405054,12.8285925,12.8288406,12.829668,12.2958104,12.5583190918,12.7367172241,12.7399597168,12.7422103882,12.8631981,13.3378762,12.5638375681,13.1961683,13.1993678,12.1210997,12.5265661,13.1332778931,13.13331604,12.1210997,13.0649372,13.0658797,12.6955714,12.8213806152,13.0641708374,13.2013835,13.1154662,13.1957473755,13.2329025269,],
'longitude':[75.4341412,75.4341377,77.6955155017,77.6952344177,77.6952628334,77.6952629697,75.7926285,76.6393805,78.2149575,77.6397007,77.6445166,77.1145378,77.7985897361,77.7985953164,77.798622112,77.7985610742,77.7986275271,74.8559568,77.6520116309,77.6519444,78.7046725,78.7046725,78.7046725,74.8372421,77.6523596,77.6506622,78.6181131,77.6397007,74.7855559,77.7972191,77.7971733,77.7971429,77.7971621,77.7970823,77.7970327,77.7970371,77.7972272,77.7970335,77.7969649,77.796956,77.7971244,75.9811564,77.7065928,77.4739615,78.1460142,78.139311,77.4380296,77.5732437,77.573201,74.8609332,74.8559568,74.8559568,75.1386825,77.6891233027,77.6899376,77.6892531,77.6902955,76.6393805,77.7842363745,77.7841222429,77.7837989946,77.7830295359,77.4336428,77.117325,75.5833357573,77.7053231,77.7095658,78.1582143,78.2149575,77.5728687166,77.5729374436,78.1582143,77.7435873,77.7444939,78.0620963,77.6606095672,77.746332751,77.7082838,77.6069977,77.7055573788,77.6956690934,],
})
对于以下纬度经度对,我正在使用 DBSCAN 对它们进行聚类
X=np.array(df[['latitude', 'longitude']])
kms_per_radian = 6371.0088
epsilon = 1 / kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=5)
model=db.fit(np.radians(X))
cluster_labels = db.labels_
num_clusters = len(set(cluster_labels))
cluster_labels = cluster_labels.astype(float)
cluster_labels[cluster_labels == -1] = np.nan
clusters = pd.Series( [X[cluster_labels==n] for n in range(num_clusters)] )
labels = pd.DataFrame(db.labels_,columns=['CLUSTER_LABEL'])
dfnew=pd.concat([df,labels],axis=1,sort=False)
我如何获取这些聚类的中心点并将其映射回数据集,以便当我在带有标记的 folium 中显示相同内容并从那里开始汇总时?
到目前为止我已经试过了
def get_centermost_point(cluster):
centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
return tuple(centermost_point)
centermost_points = clusters.map(get_centermost_point)
这给了我一个 IndexError: list index out of range 错误
获取每个簇的质心坐标:
for ea in clusters:
print(MultiPoint(ea).centroid)
结果:
POINT (12.85585784912 77.79859915316)
POINT (12.86547048333333 77.79709629166666)
POINT (13.1982603551 77.70706457576)
POINT EMPTY
从质心创建地理数据框并绘制它。
(假设坐标为long/lat)
# To create a geodataframe of the centroids
clusters_centroids = [MultiPoint(ea).centroid for ea in clusters]
crs = {'init': 'epsg:4326'}
cgdf = gpd.GeoDataFrame(clusters, crs=crs, geometry=clusters_centroids)
# Eliminate some empty row(s)
good_cdgf = cgdf[ ~cgdf['geometry'].is_empty ]
# plot to see the centroids
good_cdgf.plot()
输出图:
将中心点添加回原始数据帧df
。
这里我从检查 dfnew
开始,它只是 df
添加了列 CLUSTER_LABEL
。
print(dfnew)
user_id latitude longitude CLUSTER_LABEL
0 55 13.263394 75.434141 -1
1 55 13.263396 75.434138 -1
2 356 12.809677 77.695516 -1
3 356 12.809921 77.695234 -1
4 356 12.810059 77.695263 -1
.. ... ... ... ...
76 9271 13.064171 77.746333 -1
77 9896 13.201384 77.708284 2
78 9991 13.115466 77.606998 -1
79 9991 13.195747 77.705557 2
80 9991 13.232903 77.695669 -1
[81 rows x 4 columns]
列 CLUSTER_LABEL
将用于连接 cgdf
数据框并从中获取值。
向 cgdf
添加一个具有正确集群标签值的新 CLUSTER_LABEL
列
cgdf["CLUSTER_LABEL"] = [0,1,2, -1]
删除 cgdf
的第 0 列
cgdf.drop(columns=[0], axis=1, inplace=True)
检查电流cgdf
print(cgdf)
geometry CLUSTER_LABEL
0 POINT (12.856 77.799) 0
1 POINT (12.865 77.797) 1
2 POINT (13.198 77.707) 2
3 POINT EMPTY -1
将两个数据帧合并为新的数据帧dfnew2
。
dfnew2 = dfnew.merge(cgdf, on='CLUSTER_LABEL')
检查 dfnew2
的当前状态,它应该如下所示:
user_id latitude longitude CLUSTER_LABEL geometry
0 55 13.263394 75.434141 -1 POINT EMPTY
1 55 13.263396 75.434138 -1 POINT EMPTY
2 356 12.809677 77.695516 -1 POINT EMPTY
3 356 12.809921 77.695234 -1 POINT EMPTY
4 356 12.810059 77.695263 -1 POINT EMPTY
.. ... ... ... ... ...
76 4594 13.198635 77.706593 2 POINT (13.198 77.707)
77 6886 13.196168 77.705323 2 POINT (13.198 77.707)
78 6886 13.199368 77.709566 2 POINT (13.198 77.707)
79 9896 13.201384 77.708284 2 POINT (13.198 77.707)
80 9991 13.195747 77.705557 2 POINT (13.198 77.707)
[81 rows x 5 columns]
'dfnew2' 应该等同于带有 2 个额外特殊列的原始数据帧,'CLUSTER_LABEL' 和 'geometry'(簇的中心点)。
我有一个坐标列表,其区域映射如下
df=pd.DataFrame({'user_id':[55,55,356,356,356,356,632,752,938,963,963,1226,2663,2663,2663,2663,2663,3183,3197,3344,3387,3387,3387,3387,3396,3515,3536,3570,3819,3883,3883,3883,3883,3883,3883,3883,3883,3883,3883,3883,3883,4584,4594,4713,4931,4931,5026,5487,5487,5575,5575,5575,5602,5639,5639,5639,5639,5783,5783,5783,5783,5783,5801,6373,6718,6886,6886,7055,7055,7608,7608,7777,8186,8186,8307,8712,9271,9896,9991,9991,9991,],
'latitude':[13.2633943,13.2633964,12.809677124,12.8099212646,12.8100585938,12.810065981,12.9440132,12.2958104,12.5265661,13.0767648,13.0853577,12.6301221,12.8558120728,12.8558349609,12.8558654785,12.8558807373,12.8558959961,12.9141417,13.0696411133,13.0708333,10.7904833,10.7904833,10.7904833,12.884091,13.0694428,13.204637,12.6922086,13.0767648,13.3489958,12.8653798,12.8654014,12.8654124,12.8654448,12.8654521,12.8654658,12.8654733,12.8654815,12.8654844,12.8655367,12.8655376,12.865576,12.4025539,13.1986348,12.9548317,11.664325,11.6690603,13.0656551,13.1137554,13.1137978,12.770418,12.9141417,12.9141417,15.3530727,12.8285405054,12.8285925,12.8288406,12.829668,12.2958104,12.5583190918,12.7367172241,12.7399597168,12.7422103882,12.8631981,13.3378762,12.5638375681,13.1961683,13.1993678,12.1210997,12.5265661,13.1332778931,13.13331604,12.1210997,13.0649372,13.0658797,12.6955714,12.8213806152,13.0641708374,13.2013835,13.1154662,13.1957473755,13.2329025269,],
'longitude':[75.4341412,75.4341377,77.6955155017,77.6952344177,77.6952628334,77.6952629697,75.7926285,76.6393805,78.2149575,77.6397007,77.6445166,77.1145378,77.7985897361,77.7985953164,77.798622112,77.7985610742,77.7986275271,74.8559568,77.6520116309,77.6519444,78.7046725,78.7046725,78.7046725,74.8372421,77.6523596,77.6506622,78.6181131,77.6397007,74.7855559,77.7972191,77.7971733,77.7971429,77.7971621,77.7970823,77.7970327,77.7970371,77.7972272,77.7970335,77.7969649,77.796956,77.7971244,75.9811564,77.7065928,77.4739615,78.1460142,78.139311,77.4380296,77.5732437,77.573201,74.8609332,74.8559568,74.8559568,75.1386825,77.6891233027,77.6899376,77.6892531,77.6902955,76.6393805,77.7842363745,77.7841222429,77.7837989946,77.7830295359,77.4336428,77.117325,75.5833357573,77.7053231,77.7095658,78.1582143,78.2149575,77.5728687166,77.5729374436,78.1582143,77.7435873,77.7444939,78.0620963,77.6606095672,77.746332751,77.7082838,77.6069977,77.7055573788,77.6956690934,],
})
对于以下纬度经度对,我正在使用 DBSCAN 对它们进行聚类
X=np.array(df[['latitude', 'longitude']])
kms_per_radian = 6371.0088
epsilon = 1 / kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=5)
model=db.fit(np.radians(X))
cluster_labels = db.labels_
num_clusters = len(set(cluster_labels))
cluster_labels = cluster_labels.astype(float)
cluster_labels[cluster_labels == -1] = np.nan
clusters = pd.Series( [X[cluster_labels==n] for n in range(num_clusters)] )
labels = pd.DataFrame(db.labels_,columns=['CLUSTER_LABEL'])
dfnew=pd.concat([df,labels],axis=1,sort=False)
我如何获取这些聚类的中心点并将其映射回数据集,以便当我在带有标记的 folium 中显示相同内容并从那里开始汇总时?
到目前为止我已经试过了
def get_centermost_point(cluster):
centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
return tuple(centermost_point)
centermost_points = clusters.map(get_centermost_point)
这给了我一个 IndexError: list index out of range 错误
获取每个簇的质心坐标:
for ea in clusters:
print(MultiPoint(ea).centroid)
结果:
POINT (12.85585784912 77.79859915316)
POINT (12.86547048333333 77.79709629166666)
POINT (13.1982603551 77.70706457576)
POINT EMPTY
从质心创建地理数据框并绘制它。 (假设坐标为long/lat)
# To create a geodataframe of the centroids
clusters_centroids = [MultiPoint(ea).centroid for ea in clusters]
crs = {'init': 'epsg:4326'}
cgdf = gpd.GeoDataFrame(clusters, crs=crs, geometry=clusters_centroids)
# Eliminate some empty row(s)
good_cdgf = cgdf[ ~cgdf['geometry'].is_empty ]
# plot to see the centroids
good_cdgf.plot()
输出图:
将中心点添加回原始数据帧df
。
这里我从检查 dfnew
开始,它只是 df
添加了列 CLUSTER_LABEL
。
print(dfnew)
user_id latitude longitude CLUSTER_LABEL
0 55 13.263394 75.434141 -1
1 55 13.263396 75.434138 -1
2 356 12.809677 77.695516 -1
3 356 12.809921 77.695234 -1
4 356 12.810059 77.695263 -1
.. ... ... ... ...
76 9271 13.064171 77.746333 -1
77 9896 13.201384 77.708284 2
78 9991 13.115466 77.606998 -1
79 9991 13.195747 77.705557 2
80 9991 13.232903 77.695669 -1
[81 rows x 4 columns]
列 CLUSTER_LABEL
将用于连接 cgdf
数据框并从中获取值。
向 cgdf
CLUSTER_LABEL
列
cgdf["CLUSTER_LABEL"] = [0,1,2, -1]
删除 cgdf
cgdf.drop(columns=[0], axis=1, inplace=True)
检查电流cgdf
print(cgdf)
geometry CLUSTER_LABEL
0 POINT (12.856 77.799) 0
1 POINT (12.865 77.797) 1
2 POINT (13.198 77.707) 2
3 POINT EMPTY -1
将两个数据帧合并为新的数据帧dfnew2
。
dfnew2 = dfnew.merge(cgdf, on='CLUSTER_LABEL')
检查 dfnew2
的当前状态,它应该如下所示:
user_id latitude longitude CLUSTER_LABEL geometry
0 55 13.263394 75.434141 -1 POINT EMPTY
1 55 13.263396 75.434138 -1 POINT EMPTY
2 356 12.809677 77.695516 -1 POINT EMPTY
3 356 12.809921 77.695234 -1 POINT EMPTY
4 356 12.810059 77.695263 -1 POINT EMPTY
.. ... ... ... ... ...
76 4594 13.198635 77.706593 2 POINT (13.198 77.707)
77 6886 13.196168 77.705323 2 POINT (13.198 77.707)
78 6886 13.199368 77.709566 2 POINT (13.198 77.707)
79 9896 13.201384 77.708284 2 POINT (13.198 77.707)
80 9991 13.195747 77.705557 2 POINT (13.198 77.707)
[81 rows x 5 columns]
'dfnew2' 应该等同于带有 2 个额外特殊列的原始数据帧,'CLUSTER_LABEL' 和 'geometry'(簇的中心点)。