如果经纬度点包含在 wkt 几何形状中，如何将 Hive 中 1 个大文件/table 的扫描优化为 confirm/check

Question

我目前正在尝试将设备的每个经纬度 ping 与其邮政编码相关联。

我对经纬度设备 ping 数据进行了非标准化，并创建了叉积/笛卡尔积连接 table，其中每一行都有 ST_Point(long,lat)，geometry_shape_of_ZIP 和该几何图形的相关邮政编码。出于测试目的，我在 table 中有大约 4500 万行，并且每天的产量会增加到大约 10 亿。

即使数据被扁平化并且没有连接条件，查询也需要大约 2 小时才能完成。有没有更快的方法来计算空间查询？或者如何优化以下查询。

内联是我已经执行的一些优化步骤。使用优化，除了这一步外，所有其他操作最多可在 5 分钟内完成。我正在使用 aws 集群 2 个主节点和 5 个数据节点。

set hive.vectorized.execution.enabled = true;

set hive.execution.engine=tez;

set hive.enforce.sorting=true;

set hive.cbo.enable=true;

set hive.compute.query.using.stats=true;

set hive.stats.fetch.column.stats=true;

set hive.stats.fetch.partition.stats=true;

analyze table tele_us_zipmatch compute statistics for columns;

CREATE TABLE zipcheck (

`long4` double,

`lat4` double,

state_name string,

country_code string,

country_name string, region string,

zip int,

countyname string) PARTITIONED by (state_id string)

STORED AS ORC TBLPROPERTIES ("orc.compress" = "SNAPPY",

'orc.create.index'='true',

'orc.bloom.filter.columns'='');

INSERT OVERWRITE TABLE zipcheck PARTITION(state_id)

select long4, lat4, state_name, country_code, country_name, region, zip, countyname, state_id from tele_us_zipmatch

where ST_Contains(wkt_shape,zip_point)=TRUE;

ST_Contains 是来自 esri 的函数（参考：https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Documentation#relationship-tests）。

非常感谢任何帮助。

谢谢。

Answer 1

如果邮政编码数据集可以放入内存，请尝试自定义 Map-Reduce 应用程序，该应用程序通过调整 sample in the GIS-Tools-for-Hadoop.

[合作者]

如果经纬度点包含在 wkt 几何形状中，如何将 Hive 中 1 个大文件/table 的扫描优化为 confirm/check

How to optimize scan of 1 huge file / table in Hive to confirm/check if lat long point is contained in a wkt geometry shape

hadoop

hive

spatial

geospatial

hiveql