子查询中使用的内存和执行顺序

Question

我正在使用来自 https://dev.maxmind.com/geoip/geoip2/geolite2/ 的 csv 格式数据。一般是ip block到asn和country映射的数据。

我有2个table都是Memory引擎，第一个有299727条记录，第二个有406685条。

SELECT *
FROM __ip_block_to_country 
LIMIT 5

┌─network────┬───────id─┬───min_ip─┬───max_ip─┬─geoname_id─┬─country_iso_code─┬─country_name─┐
│ 1.0.0.0/24 │ 16777216 │ 16777217 │ 16777472 │    2077456 │ AU               │ Australia    │
│ 1.0.1.0/24 │ 16777472 │ 16777473 │ 16777728 │    1814991 │ CN               │ China        │
│ 1.0.2.0/23 │ 16777728 │ 16777729 │ 16778240 │    1814991 │ CN               │ China        │
│ 1.0.4.0/22 │ 16778240 │ 16778241 │ 16779264 │    2077456 │ AU               │ Australia    │
│ 1.0.8.0/21 │ 16779264 │ 16779265 │ 16781312 │    1814991 │ CN               │ China        │
└────────────┴──────────┴──────────┴──────────┴────────────┴──────────────────┴──────────────┘

SELECT *
FROM __ip_block_to_asn 
LIMIT 5

┌─network──────┬─autonomous_system_number─┬─autonomous_system_organization─┬───────id─┬─subnet_count─┬───min_ip─┬───max_ip─┐
│ 1.0.0.0/24   │                    13335 │ Cloudflare Inc                 │ 16777216 │          255 │ 16777217 │ 16777472 │
│ 1.0.4.0/22   │                    56203 │ Gtelecom-AUSTRALIA             │ 16778240 │         1023 │ 16778241 │ 16779264 │
│ 1.0.16.0/24  │                     2519 │ ARTERIA Networks Corporation   │ 16781312 │          255 │ 16781313 │ 16781568 │
│ 1.0.64.0/18  │                    18144 │ Energia Communications,Inc.    │ 16793600 │        16383 │ 16793601 │ 16809984 │
│ 1.0.128.0/17 │                    23969 │ TOT Public Company Limited     │ 16809984 │        32767 │ 16809985 │ 16842752 │
└──────────────┴──────────────────────────┴────────────────────────────────┴──────────┴──────────────┴──────────┴──────────┘

现在，我想检查哪个国家/地区覆盖了一个 asn 的整个 ip 池。下面的查询只是为了获取statisfied country的索引。

SELECT idx from(
SELECT 
    (
        SELECT groupArray(min_ip),groupArray(max_ip),groupArray(country_iso_code),groupArray(country_name)
        FROM __ip_block_to_country
    ) t,
    arrayFilter((i,mii, mai) -> min_ip >= mii and max_ip <= mai, arrayEnumerate(t.1), t.1, t.2) as idx
FROM __ip_block_to_asn
);

我遇到以下异常： Received exception from server (version 1.1.54394): Code: 241. DB::Exception: Received from localhost:9000, ::1. DB::Exception: Memory limit (for query) exceeded: would use 512.02 GiB (attempt to allocate chunk of 549755813888 bytes), maximum: 37.25 GiB.

我的问题是：

好像语句SELECT groupArray(min_ip),groupArray(max_ip),groupArray(country_iso_code),groupArray(country_name)是随着__ip_block_to_asn的每条记录一起执行的，那么查询需要这么多内存。我的查询是这样吗？

Answer 1

标量子查询只执行一次。

但是要执行 arrayFilter，数组会乘以来自 __ip_block_to_asn table 的已处理块的行数。它类似于两个 table 的交叉连接。

为了克服这个问题，您可以对 __ip_block_to_asn 中的 SELECT 使用较小的块大小。它由 max_block_size 设置控制。但是对于 Memory table，块的大小始终与它们插入 table 时的大小相同，无论 SELECT 期间的 max_block_size 设置如何。要允许灵活的块大小，您可以将此 table 重新加载到 TinyLog 引擎。

CREATE TABLE __ip_block_to_asn2 ENGINE = TinyLog AS SELECT * FROM __ip_block_to_asn

然后执行：

SET max_block_size = 10;

SELECT idx from(
SELECT 
(
    SELECT groupArray(min_ip),groupArray(max_ip),groupArray(country_iso_code),groupArray(country_name)
    FROM __ip_block_to_country
) t,
arrayFilter((i,mii, mai) -> min_ip >= mii and max_ip <= mai, arrayEnumerate(t.1), t.1, t.2) as idx
FROM __ip_block_to_asn2
);

子查询中使用的内存和执行顺序

memory used and execution order in sub-query

clickhouse