如何使用通用键列和位置连接使用 geopandas

How to sjoin using geopandas using a common key column and also location

假设我有一个由两列组成的数据框 A:几何(点)和小时。 数据框 B 也由几何(形状)和小时组成。

我熟悉标准 sjoin 。我想要做的是仅当时间 相同 时才从两个表中创建 sjoin link 行。在传统的连接术语中,键是几何和小时。我怎样才能达到它?

回顾了两个逻辑方法

  • 空间连接后跟过滤器
  • 首先在小时分片(过滤)数据帧,空间连接分片并连接分片数据帧的结果
  • 相等性测试结果
  • 运行一些时间

结论

  • 此测试数据集的时间差异很小。 简单如果点数少
  • 更快
import pandas as pd
import numpy as np
import geopandas as gpd
import shapely.geometry
import requests

# source some points and polygons
# fmt: off
dfp = pd.read_html("https://www.latlong.net/category/cities-235-15.html")[0]
dfp = gpd.GeoDataFrame(dfp, geometry=dfp.loc[:,["Longitude", "Latitude",]].apply(shapely.geometry.Point, axis=1))
res = requests.get("https://opendata.arcgis.com/datasets/69dc11c7386943b4ad8893c45648b1e1_0.geojson")
df_poly = gpd.GeoDataFrame.from_features(res.json())
# fmt: on
# bulk up number of points
dfp = pd.concat([dfp for _ in range(1000)]).reset_index()
HOURS = 24
dfp["hour"] = np.random.randint(0, HOURS, len(dfp))
df_poly["hour"] = np.random.randint(0, HOURS, len(df_poly))

def simple():
    return gpd.sjoin(dfp, df_poly).loc[lambda d: d["hour_left"] == d["hour_right"]]

def shard():
    return pd.concat(
        [
            gpd.sjoin(*[d.loc[d["hour"].eq(h)] for d in [dfp, df_poly]])
            for h in range(HOURS)
        ]
    )

print(f"""length test: {len(simple()) == len(shard())} {len(simple())}
dataframe test: {simple().sort_index().equals(shard().sort_index())}
points: {len(dfp)}
polygons: {len(df_poly)}""")

%timeit simple()
%timeit shard()

输出

length test: True 3480
dataframe test: True
points: 84000
polygons: 379
6.48 s ± 311 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.05 s ± 34.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)