根据数据帧中的值(纬度和经度)计算数据帧子集的统计信息
Calculate statistics on subset of a dataframe based on values in dataframe (latitude and longitude)
我希望计算数据框子集的摘要统计信息,但与行内的特定值相关。
例如,我有一个包含经纬度和人数的数据框。
df = pd.DataFrame({'latitude': [40.991919 , 40.992001 , 40.991602, 40.989903, 40.987759],
'longitude': [-106.049469, -106.048812, -106.048904, -106.049907, -106.048840],
'people': [1,2,3,4,5]})
我想知道每行 0.05 英里范围内的总人数。这可以通过循环轻松创建,但是随着 space 开始增加,这变得不可用。
Current/Sample:
from geopy.distance import distance
def distance_calc (row, focus_lat, focus_long):
start = (row['latitude'], row['longitude'])
stop = (focus_lat, focus_long)
return distance(start, stop).miles
df['total_people_within_05'] = 0
df['total_rows_within_05'] = 0
for index, row in df.iterrows():
focus_lat = df['latitude'][index]
focus_long = df['longitude'][index]
new_df = df.copy()
new_df['distance'] = new_df.apply (lambda row: (distance_calc(row, focus_lat, focus_long)),axis=1)
df.at[index, 'total_people_within_05'] = new_df.loc[new_df.distance<=.05]['people'].sum()
df.at[index, 'total_rows_within_05'] = new_df.loc[new_df.distance<=.05].shape[0]
有什么 pythonic 方法可以做到这一点吗?
- 笛卡尔积自身得到所有组合。这在更大的数据集上会很昂贵。这会生成 N^2 行,因此在本例中为 25 行
- 计算每个组合的距离
- 根据需要的距离过滤
query()
groupby()
获取总人数。还生成 索引 的 list
以帮助提高透明度
- 终于
join()
重新组合在一起,你得到了你想要的东西
import geopy.distance as gd
df = pd.DataFrame({'latitude': [40.991919 , 40.992001 , 40.991602, 40.989903, 40.987759],
'longitude': [-106.049469, -106.048812, -106.048904, -106.049907, -106.048840],
'people': [1,2,3,4,5]})
df = df.join((df.reset_index().assign(foo=1).merge(df.reset_index().assign(foo=1), on="foo")
.assign(distance=lambda dfa: dfa.apply(lambda r: gd.distance((r.latitude_x,r.longitude_x),
(r.latitude_y,r.longitude_y)).miles, axis=1))
.query("distance<=0.05")
.rename(columns={"people_y":"nearby"})
.groupby("index_x").agg({"nearby":"sum","index_y":lambda x: list(x)})
))
print(df.to_markdown())
latitude
longitude
people
nearby
index_y
0
40.9919
-106.049
1
6
[0, 1, 2]
1
40.992
-106.049
2
6
[0, 1, 2]
2
40.9916
-106.049
3
6
[0, 1, 2]
3
40.9899
-106.05
4
4
[3]
4
40.9878
-106.049
5
5
[4]
更新 - 使用组合而不是笛卡尔积
一直困扰我的是笛卡尔积是一个巨大的开销,而所需要做的只是计算有效组合之间的距离
- 利用
itertools.combinations()
制作一个有效的索引组合列表
- 计算这个最小集合之间的距离
- 过滤到我们感兴趣的距离
- 现在构建这个较小集合的排列以提供与实际数据的简单连接
- 加入并聚合
# get distances between all valid combinations
dfd = (pd.DataFrame(list(itertools.combinations(df.index, 2)))
.merge(df, left_on=0, right_index=True)
.merge(df, left_on=1, right_index=True, suffixes=("_0","_1"))
.assign(distance=lambda dfa: dfa.apply(lambda r: gd.distance((r.latitude_0,r.longitude_0),
(r.latitude_1,r.longitude_1)).miles, axis=1))
.loc[:,[0,1,"distance"]]
# filter down to close proximities
.query("distance <= 0.05")
)
# build all valid permuations of close by combinations
dfnppl = (pd.DataFrame(itertools.permutations(pd.concat([dfd[0],dfd[1]]).unique(), 2))
.merge(df.loc[:,"people"], left_on=1, right_index=True)
)
# bring it all together
df = (df.reset_index().rename(columns={"index":0}).merge(dfnppl, on=0, suffixes=("","_near"), how="left")
.groupby(0).agg({**{c:"first" for c in df.columns}, **{"people_near":"sum"}})
)
0
latitude
longitude
people
people_near
0
40.9919
-106.049
1
5
1
40.992
-106.049
2
4
2
40.9916
-106.049
3
3
3
40.9899
-106.05
4
0
4
40.9878
-106.049
5
0
我希望计算数据框子集的摘要统计信息,但与行内的特定值相关。
例如,我有一个包含经纬度和人数的数据框。
df = pd.DataFrame({'latitude': [40.991919 , 40.992001 , 40.991602, 40.989903, 40.987759],
'longitude': [-106.049469, -106.048812, -106.048904, -106.049907, -106.048840],
'people': [1,2,3,4,5]})
我想知道每行 0.05 英里范围内的总人数。这可以通过循环轻松创建,但是随着 space 开始增加,这变得不可用。
Current/Sample:
from geopy.distance import distance
def distance_calc (row, focus_lat, focus_long):
start = (row['latitude'], row['longitude'])
stop = (focus_lat, focus_long)
return distance(start, stop).miles
df['total_people_within_05'] = 0
df['total_rows_within_05'] = 0
for index, row in df.iterrows():
focus_lat = df['latitude'][index]
focus_long = df['longitude'][index]
new_df = df.copy()
new_df['distance'] = new_df.apply (lambda row: (distance_calc(row, focus_lat, focus_long)),axis=1)
df.at[index, 'total_people_within_05'] = new_df.loc[new_df.distance<=.05]['people'].sum()
df.at[index, 'total_rows_within_05'] = new_df.loc[new_df.distance<=.05].shape[0]
有什么 pythonic 方法可以做到这一点吗?
- 笛卡尔积自身得到所有组合。这在更大的数据集上会很昂贵。这会生成 N^2 行,因此在本例中为 25 行
- 计算每个组合的距离
- 根据需要的距离过滤
query()
groupby()
获取总人数。还生成 索引 的list
以帮助提高透明度- 终于
join()
重新组合在一起,你得到了你想要的东西
import geopy.distance as gd
df = pd.DataFrame({'latitude': [40.991919 , 40.992001 , 40.991602, 40.989903, 40.987759],
'longitude': [-106.049469, -106.048812, -106.048904, -106.049907, -106.048840],
'people': [1,2,3,4,5]})
df = df.join((df.reset_index().assign(foo=1).merge(df.reset_index().assign(foo=1), on="foo")
.assign(distance=lambda dfa: dfa.apply(lambda r: gd.distance((r.latitude_x,r.longitude_x),
(r.latitude_y,r.longitude_y)).miles, axis=1))
.query("distance<=0.05")
.rename(columns={"people_y":"nearby"})
.groupby("index_x").agg({"nearby":"sum","index_y":lambda x: list(x)})
))
print(df.to_markdown())
latitude | longitude | people | nearby | index_y | |
---|---|---|---|---|---|
0 | 40.9919 | -106.049 | 1 | 6 | [0, 1, 2] |
1 | 40.992 | -106.049 | 2 | 6 | [0, 1, 2] |
2 | 40.9916 | -106.049 | 3 | 6 | [0, 1, 2] |
3 | 40.9899 | -106.05 | 4 | 4 | [3] |
4 | 40.9878 | -106.049 | 5 | 5 | [4] |
更新 - 使用组合而不是笛卡尔积
一直困扰我的是笛卡尔积是一个巨大的开销,而所需要做的只是计算有效组合之间的距离
- 利用
itertools.combinations()
制作一个有效的索引组合列表 - 计算这个最小集合之间的距离
- 过滤到我们感兴趣的距离
- 现在构建这个较小集合的排列以提供与实际数据的简单连接
- 加入并聚合
# get distances between all valid combinations
dfd = (pd.DataFrame(list(itertools.combinations(df.index, 2)))
.merge(df, left_on=0, right_index=True)
.merge(df, left_on=1, right_index=True, suffixes=("_0","_1"))
.assign(distance=lambda dfa: dfa.apply(lambda r: gd.distance((r.latitude_0,r.longitude_0),
(r.latitude_1,r.longitude_1)).miles, axis=1))
.loc[:,[0,1,"distance"]]
# filter down to close proximities
.query("distance <= 0.05")
)
# build all valid permuations of close by combinations
dfnppl = (pd.DataFrame(itertools.permutations(pd.concat([dfd[0],dfd[1]]).unique(), 2))
.merge(df.loc[:,"people"], left_on=1, right_index=True)
)
# bring it all together
df = (df.reset_index().rename(columns={"index":0}).merge(dfnppl, on=0, suffixes=("","_near"), how="left")
.groupby(0).agg({**{c:"first" for c in df.columns}, **{"people_near":"sum"}})
)
0 | latitude | longitude | people | people_near |
---|---|---|---|---|
0 | 40.9919 | -106.049 | 1 | 5 |
1 | 40.992 | -106.049 | 2 | 4 |
2 | 40.9916 | -106.049 | 3 | 3 |
3 | 40.9899 | -106.05 | 4 | 0 |
4 | 40.9878 | -106.049 | 5 | 0 |