使用 geopandas(或其他工具)从庞大的数据集中查找嵌套框
find nested boxes from huge dataset with geopandas (or other tools)
基本上我 DataFrame
有大量用 xmin
ymin
xmax
ymax
元组定义的框。
xmin ymin xmax ymax
0 66 88 130 151
1 143 390 236 468
2 77 331 143 423
3 289 112 337 157
4 343 282 405 352
.....
我的任务是删除所有嵌套框。 (即必须删除 within
另一个框的任何框)
我目前的方法:
- 用长方体几何构造 GeoDataFrame
- 按框大小排序(降序)
- 在较大的盒子中反复查找较小的盒子。
沙盒:https://www.kaggle.com/code/easzil/remove-nested-bbox/
def remove_nested_bbox(df):
# make an extra unique 'id'
df['__id'] = range(0, len(df))
# create geometry
df['__geometry'] = df.apply(lambda x: shapely.geometry.box(x.xmin, x.ymin, x.xmax, x.ymax), axis=1)
gdf = gpd.GeoDataFrame(df, geometry='__geometry')
# sort by area
gdf['__area'] = gdf.__geometry.area
gdf.sort_values('__area', ascending=False, inplace=True)
nested_id = set()
for iloc in range(len(gdf)):
# skip aready identifed one
if gdf.iloc[iloc]['__id'] in nested_id:
continue
bbox = gdf.iloc[iloc] # current target larger bbox
tests = gdf.iloc[iloc+1:] # all bboxes smaller than the urrent target
tests = tests[~tests['__id'].isin(nested_id)] # skip aready identifed one
nested = tests[tests['__geometry'].within(bbox['__geometry'])]
nested_id.update(list(nested['__id']))
df = df[~df['__id'].isin(nested_id)]
del df['__id']
del df['__geometry']
del df['__area']
return df
有没有更好的方法来优化任务以使其更快?
目前的方法处理大型数据集的速度很慢。
我也会考虑其他方法,例如用 C 或 CUDA 实现。
- 您的示例数据不大,并且没有框内框的实例。随机生成了一些
- 已使用
loc
检查维度更大的方法
- 不确定这是否比您的方法更快,时间细节
%timeit gdf["within"] = gdf.apply(within, args=(gdf,), axis=1)
print(f"""number of polygons: {len(gdf)}
number kept: {len(gdf.loc[lambda d: ~d["within"]])}
""")
2.37 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
number of polygons: 2503
number kept: 241
视觉效果
完整代码
import pandas as pd
import numpy as np
import geopandas as gpd
import io
import shapely
df = pd.read_csv(
io.StringIO(
""" xmin ymin xmax ymax
0 66 88 130 151
1 143 390 236 468
2 77 331 143 423
3 289 112 337 157
4 343 282 405 352"""
),
sep="\s+",
)
# randomly generate some boxes, check they are valid
df = pd.DataFrame(
np.random.randint(1, 200, [10000, 4]), columns=["xmin", "ymin", "xmax", "ymax"]
).loc[lambda d: (d["xmax"] > d["xmin"]) & (d["ymax"] > d["ymin"])]
gdf = gpd.GeoDataFrame(
df, geometry=df.apply(lambda r: shapely.geometry.box(*r), axis=1)
)
gdf.plot(edgecolor="black", alpha=0.6)
# somewhat optimised by limiting polygons that are considered by looking at dimensions
def within(r, gdf):
for g in gdf.loc[
~(gdf.index == r.name)
& gdf["xmin"].lt(r["xmin"])
& gdf["ymin"].lt(r["ymin"])
& gdf["xmax"].gt(r["xmax"])
& gdf["ymax"].gt(r["ymax"]),
"geometry",
]:
if r["geometry"].within(g):
return True
return False
gdf["within"] = gdf.apply(within, args=(gdf, ), axis=1)
gdf.loc[lambda d: ~d["within"]].plot(edgecolor="black", alpha=0.6)
方法 2
- 使用您在 kaggle 上提供的样本数据
- 这个 returns 与以前的版本相比,用了大约一半的时间 (5s)
- 概念类似,如果 xmin 和 ymin 大于另一个框且 max 和 ymax 小于另一个框,则框在另一个框内
import functools
df = pd.read_csv("https://storage.googleapis.com/kagglesdsdata/datasets/2015126/3336994/sample.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20220322%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20220322T093633Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=3cc7824afe45313fe152858a6b8d79f93b0d90237ad82737fcf28949b9314df4be2f247821934a371d09cff4b463d69fc2422d8d7f746d6fccf014605b2e0f2cba54c23fba012c2531c4cd714436545bd83db0e880072fa049b116106ba4e296c259c32bc19267a15b9b9af78494bb6859cb53ffe4388c3b8c375a330e09008bb1d9c839f8ab4c14a8f01c38179ba31dc9f4ea9fa11f5ecc7e6ba87757edbe48577d60988349b948ceb70e885be5d6ebc36abe438a5275fa683ee4e318e21661ea032af7d8e2f488020288a1a2ff15af8aa153bb8ac33a0b827dd53c928ddf3abb024f2972ba6ef21bc9a0034e504706a2b3fc78be9ea3bb9190437d98a8ab35")
def within_np(df):
d = {}
for c in df.columns[0:4]:
a = np.tile(df[c].values.T ,(len(df),1))
d[c] = a.T > a if c[1:] == "min" else a.T < a
aa = functools.reduce(np.logical_and, (aa for aa in d.values()))
return aa.sum(axis=1)>0
df.loc[~within_np(df)]
基本上我 DataFrame
有大量用 xmin
ymin
xmax
ymax
元组定义的框。
xmin ymin xmax ymax
0 66 88 130 151
1 143 390 236 468
2 77 331 143 423
3 289 112 337 157
4 343 282 405 352
.....
我的任务是删除所有嵌套框。 (即必须删除 within
另一个框的任何框)
我目前的方法:
- 用长方体几何构造 GeoDataFrame
- 按框大小排序(降序)
- 在较大的盒子中反复查找较小的盒子。
沙盒:https://www.kaggle.com/code/easzil/remove-nested-bbox/
def remove_nested_bbox(df):
# make an extra unique 'id'
df['__id'] = range(0, len(df))
# create geometry
df['__geometry'] = df.apply(lambda x: shapely.geometry.box(x.xmin, x.ymin, x.xmax, x.ymax), axis=1)
gdf = gpd.GeoDataFrame(df, geometry='__geometry')
# sort by area
gdf['__area'] = gdf.__geometry.area
gdf.sort_values('__area', ascending=False, inplace=True)
nested_id = set()
for iloc in range(len(gdf)):
# skip aready identifed one
if gdf.iloc[iloc]['__id'] in nested_id:
continue
bbox = gdf.iloc[iloc] # current target larger bbox
tests = gdf.iloc[iloc+1:] # all bboxes smaller than the urrent target
tests = tests[~tests['__id'].isin(nested_id)] # skip aready identifed one
nested = tests[tests['__geometry'].within(bbox['__geometry'])]
nested_id.update(list(nested['__id']))
df = df[~df['__id'].isin(nested_id)]
del df['__id']
del df['__geometry']
del df['__area']
return df
有没有更好的方法来优化任务以使其更快? 目前的方法处理大型数据集的速度很慢。
我也会考虑其他方法,例如用 C 或 CUDA 实现。
- 您的示例数据不大,并且没有框内框的实例。随机生成了一些
- 已使用
loc
检查维度更大的方法 - 不确定这是否比您的方法更快,时间细节
%timeit gdf["within"] = gdf.apply(within, args=(gdf,), axis=1)
print(f"""number of polygons: {len(gdf)}
number kept: {len(gdf.loc[lambda d: ~d["within"]])}
""")
2.37 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
number of polygons: 2503
number kept: 241
视觉效果
完整代码
import pandas as pd
import numpy as np
import geopandas as gpd
import io
import shapely
df = pd.read_csv(
io.StringIO(
""" xmin ymin xmax ymax
0 66 88 130 151
1 143 390 236 468
2 77 331 143 423
3 289 112 337 157
4 343 282 405 352"""
),
sep="\s+",
)
# randomly generate some boxes, check they are valid
df = pd.DataFrame(
np.random.randint(1, 200, [10000, 4]), columns=["xmin", "ymin", "xmax", "ymax"]
).loc[lambda d: (d["xmax"] > d["xmin"]) & (d["ymax"] > d["ymin"])]
gdf = gpd.GeoDataFrame(
df, geometry=df.apply(lambda r: shapely.geometry.box(*r), axis=1)
)
gdf.plot(edgecolor="black", alpha=0.6)
# somewhat optimised by limiting polygons that are considered by looking at dimensions
def within(r, gdf):
for g in gdf.loc[
~(gdf.index == r.name)
& gdf["xmin"].lt(r["xmin"])
& gdf["ymin"].lt(r["ymin"])
& gdf["xmax"].gt(r["xmax"])
& gdf["ymax"].gt(r["ymax"]),
"geometry",
]:
if r["geometry"].within(g):
return True
return False
gdf["within"] = gdf.apply(within, args=(gdf, ), axis=1)
gdf.loc[lambda d: ~d["within"]].plot(edgecolor="black", alpha=0.6)
方法 2
- 使用您在 kaggle 上提供的样本数据
- 这个 returns 与以前的版本相比,用了大约一半的时间 (5s)
- 概念类似,如果 xmin 和 ymin 大于另一个框且 max 和 ymax 小于另一个框,则框在另一个框内
import functools
df = pd.read_csv("https://storage.googleapis.com/kagglesdsdata/datasets/2015126/3336994/sample.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20220322%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20220322T093633Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=3cc7824afe45313fe152858a6b8d79f93b0d90237ad82737fcf28949b9314df4be2f247821934a371d09cff4b463d69fc2422d8d7f746d6fccf014605b2e0f2cba54c23fba012c2531c4cd714436545bd83db0e880072fa049b116106ba4e296c259c32bc19267a15b9b9af78494bb6859cb53ffe4388c3b8c375a330e09008bb1d9c839f8ab4c14a8f01c38179ba31dc9f4ea9fa11f5ecc7e6ba87757edbe48577d60988349b948ceb70e885be5d6ebc36abe438a5275fa683ee4e318e21661ea032af7d8e2f488020288a1a2ff15af8aa153bb8ac33a0b827dd53c928ddf3abb024f2972ba6ef21bc9a0034e504706a2b3fc78be9ea3bb9190437d98a8ab35")
def within_np(df):
d = {}
for c in df.columns[0:4]:
a = np.tile(df[c].values.T ,(len(df),1))
d[c] = a.T > a if c[1:] == "min" else a.T < a
aa = functools.reduce(np.logical_and, (aa for aa in d.values()))
return aa.sum(axis=1)>0
df.loc[~within_np(df)]