为什么同时执行任务失败并导致 Ignite 冻结？

Question

OS: Ubuntu 18.04
阿帕奇点燃：2.8.1

其他信息： Flask 应用程序 - 用于 API 个端点 pyignite - 作为 Apache Ignite 的瘦客户端

已加载数据：
5 张桌子：
A：1000万条记录
B：3750万条记录
C：1000万条记录
D：2530 万条记录
E：550 万条记录

Ignite Persistence 总共占用 29GB space。

我正在执行两个简单的 SQL 查询：
查询 1：

    SELECT  SUM(TABLE_B.ID) AS num_people, TABLE_B.PRODUCT, TABLE_C.CITY
    JOIN TABLE_C 
    ON TABLE_B.ID = TABLE_C.ID
    GROUP BY TABLE_B.PRODUCT, TABLE_C.CITY
    ORDER BY num_people DESC
    LIMIT 1000

查询 2：

    SELECT  SUM(TABLE_A.ID) AS num_people, TABLE_C.CITY
    JOIN TABLE_C 
    ON TABLE_A.ID = TABLE_C.ID
    GROUP BY TABLE_C.CITY
    ORDER BY num_people DESC
    LIMIT 1000

我运行在 docker 容器中安装 Apache Ignite。

这是配置：

<?xml version="1.0" encoding="UTF-8"?>
<!--
  Licensed to the Apache Software Foundation (ASF) ...
-->
<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="
       http://www.springframework.org/schema/beans
       http://www.springframework.org/schema/beans/spring-beans.xsd">
    <!--
        Alter configuration below as needed.
    -->
    <bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
        <!-- Enabling Apache Ignite Persistent Store. -->
        <property name="dataStorageConfiguration">
            <bean class="org.apache.ignite.configuration.DataStorageConfiguration">
                <property name="defaultDataRegionConfiguration">
                    <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
                        <!-- Setting the size of the default region to 48GB. -->
                        <property name="maxSize" value="#{48L * 1024 * 1024 * 1024}"/>
                        <!-- Set the page size to 4 KB -->
                        <property name="pageSize" value="#{4 * 1024}"/>
                        <!-- Enable Native Persistence - required for Authentication -->
                        <property name="persistenceEnabled" value="true"/>
                    </bean>
                </property>
            </bean>
        </property>

        <!-- Enabling authentication. -->
        <property name="authenticationEnabled" value="true"/>

        <!-- Enabling node discovery for Ignite Visor. -->
        <property name="discoverySpi">
            <bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
                <property name="ipFinder">
                    <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
                        <property name="addresses">
                            <list>
                                <value>ignite:47500</value>
                                <value>ignite:47501</value>
                            </list>
                        </property>
                    </bean>
                </property>
            </bean>
        </property>
        <property name="communicationSpi">
            <bean class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
                <property name="sharedMemoryPort" value="-1"/>
            </bean>
        </property>

        <!-- Configure internal thread pool. -->
        <!-- <property name="publicThreadPoolSize" value="32"/>-->

        <!-- Configure system thread pool. -->
        <!--        <property name="systemThreadPoolSize" value="32"/>-->

        <!-- Configure query thread pool. -->
        <!-- <property name="queryThreadPoolSize" value="32"/>-->

        <property name="cacheConfiguration">
            <list>
                <bean abstract="true" class="org.apache.ignite.configuration.CacheConfiguration"
                      id="cache-template-bean">
                    <!-- when you create a template via XML configuration, you must add an asterisk to
                    the name of the template -->
                    <property name="name" value="tbl_pll*"/>
                    <property name="cacheMode" value="PARTITIONED"/>
                    <property name="backups" value="1"/>
                    <property name="queryParallelism" value="32"/>
                    <!-- Other cache parameters -->
                </bean>
                <bean abstract="true" class="org.apache.ignite.configuration.CacheConfiguration"
                      id="cache-template-bean">
                    <!-- when you create a template via XML configuration, you must add an asterisk to
                    the name of the template -->
                    <property name="name" value="tbl_hf_pll*"/>
                    <property name="cacheMode" value="PARTITIONED"/>
                    <property name="backups" value="1"/>
                    <property name="queryParallelism" value="16"/>
                    <!-- Other cache parameters -->
                </bean>
            </list>
        </property>

        <property name="failoverSpi">
            <bean class="org.apache.ignite.spi.failover.always.AlwaysFailoverSpi">
                <property name="maximumFailoverAttempts" value="2"/>
            </bean>
        </property>

        <!-- Execute one job at a time. -->
        <property name="collisionSpi">
            <bean class="org.apache.ignite.spi.collision.fifoqueue.FifoQueueCollisionSpi">

                <property name="parallelJobsNumber" value="1"/>
            </bean>
        </property>

    </bean>
</beans>

当我运行同时查询两个 SQL 时，它们都冻结并且第二个查询总是抛出类似的错误。有时是未知类型代码：b' ' 而这次是：

app_1  | ERROR:app:Unknown type code: `b'm'`
app_1  | DEBUG:app:Traceback (most recent call last):
app_1  |   File "/usr/local/lib/python3.6/site-packages/pyignite/datatypes/internal.py", line 275, in parse
app_1  |     data_class = tc_map(type_code)
app_1  |   File "/usr/local/lib/python3.6/site-packages/pyignite/datatypes/internal.py", line 108, in tc_map
app_1  |     return _memo_map[key]
app_1  | KeyError: b'm'
app_1  | 
app_1  | During handling of the above exception, another exception occurred:
app_1  | 
app_1  | Traceback (most recent call last):
app_1  |   File "/app/analytics/views.py", line 832, in post
app_1  |     ignore_data_limit=ignore_data_limit
app_1  |   File "/app/analytics/apply_rules.py", line 769, in create_view_ignite
app_1  |     dataset_source_details=dataset_source_details)
app_1  |   File "/app/analytics/apply_rules.py", line 685, in apply_rules_ignite
app_1  |     explain = client.sql('EXPLAIN ' + QUERY)
app_1  |   File "/usr/local/lib/python3.6/site-packages/pyignite/client.py", line 401, in sql
app_1  |     max_rows, timeout,
app_1  |   File "/usr/local/lib/python3.6/site-packages/pyignite/api/sql.py", line 379, in sql_fields
app_1  |     response_class, recv_buffer = response_struct.parse(connection)
app_1  |   File "/usr/local/lib/python3.6/site-packages/pyignite/queries/__init__.py", line 146, in parse
app_1  |     field_class, field_buffer = AnyDataObject.parse(client)
app_1  |   File "/usr/local/lib/python3.6/site-packages/pyignite/datatypes/internal.py", line 277, in parse
app_1  |     raise ParseError('Unknown type code: `{}`'.format(type_code))
app_1  | pyignite.exceptions.ParseError: Unknown type code: `b'm'`

点燃日志：

ignite_1              | [10:29:31,529][WARNING][query-#176][IgniteH2Indexing] Long running query is finished [time=6230ms, type=MAP, distributedJoin=false, enforceJoinOrder=true, lazy=false, schema=PUBLIC, node=TcpDiscoveryNode [id=a3f4gthv, consistentId=a4deb8, addrs=ArrayList [127.0.0.1, 172.19.0.5], sockAddrs=HashSet [/127.0.0.1:47500, 97ba4d812948/172.19.0.5:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1603362363483, loc=true, ver=2.8.1#20200521-sha1:86422096, isClient=false], reqId=6, segment=31, sql='SELECT
ignite_1              | __Z0.ID __C0_0,
ignite_1              | __Z1.CITY __C0_1
ignite_1              | FROM PUBLIC.TABLE_C __Z1 
ignite_1              |  INNER JOIN PUBLIC.TABLE_A __Z0 
ignite_1              |  ON TRUE
ignite_1              | WHERE __Z0.ID = __Z1.ID', plan=SELECT
ignite_1              |     __Z0.ID AS __C0_0,
ignite_1              |     __Z1.CITY AS __C0_1
ignite_1              | FROM PUBLIC.TABLE_C __Z1
ignite_1              |     /* PUBLIC."IDX_TABLE_C_proxy" */
ignite_1              | INNER JOIN PUBLIC.TABLE_A __Z0
ignite_1              |     /* PUBLIC."_key_PK_proxy": ID = __Z1.ID */
ignite_1              |     ON 1=1
ignite_1              | WHERE __Z0.ID = __Z1.ID]
ignite_1              | [10:29:34,890][WARNING][long-qry-#89][LongRunningQueryManager] Query execution is too long [time=3363ms, type=REDUCE, distributedJoin=false, enforceJoinOrder=false, lazy=false, schema=PUBLIC, reqId=6, sql='SELECT COUNT(DISTINCT `TABLE_A`.`id`) `num_people`,`TABLE_C`.`city` FROM `TABLE_A` JOIN `TABLE_C` ON `TABLE_A`.`id`=`TABLE_C`.`id` GROUP BY `TABLE_C`.`city` ORDER BY `num_people` DESC LIMIT 10000]
ignite_1              | [10:29:35,923][WARNING][client-connector-#211][IgniteH2Indexing] Long running query is finished [time=4400ms, type=REDUCE, distributedJoin=false, enforceJoinOrder=false, lazy=false, schema=PUBLIC, reqId=6, sql='SELECT COUNT(DISTINCT `TABLE_A`.`id`) `num_people`,`TABLE_C`.`city` FROM `TABLE_A` JOIN `TABLE_C` ON `TABLE_A`.`id`=`TABLE_C`.`id` GROUP BY `TABLE_C`.`city` ORDER BY `num_people` DESC LIMIT 10000]
ignite_1              | [10:30:03,730][INFO][grid-timeout-worker-#71][IgniteKernal] 
ignite_1              | Metrics for local node (to disable set 'metricsLogFrequency' to 0)
ignite_1              |     ^-- Node [id=a3f4gthv, uptime=00:04:00.021]
ignite_1              |     ^-- H/N/C [hosts=1, nodes=1, CPUs=32]
ignite_1              |     ^-- CPU [cur=0.03%, avg=7.18%, GC=0%]
ignite_1              |     ^-- PageMemory [pages=3141162]
ignite_1              |     ^-- Heap [used=5288MB, free=67.72%, comm=16384MB]
ignite_1              |     ^-- Off-heap [used=12413MB, free=74.9%, comm=49352MB]
ignite_1              |     ^--   sysMemPlc region [used=0MB, free=99.99%, comm=100MB]
ignite_1              |     ^--   default region [used=12413MB, free=74.74%, comm=49152MB]
ignite_1              |     ^--   metastoreMemPlc region [used=0MB, free=99.92%, comm=0MB]
ignite_1              |     ^--   TxLog region [used=0MB, free=100%, comm=100MB]
ignite_1              |     ^-- Ignite persistence [used=29178MB]
ignite_1              |     ^--   sysMemPlc region [used=0MB]
ignite_1              |     ^--   default region [used=29177MB]
ignite_1              |     ^--   metastoreMemPlc region [used=0MB]
ignite_1              |     ^--   TxLog region [used=0MB]
ignite_1              |     ^-- Outbound messages queue [size=0]
ignite_1              |     ^-- Public thread pool [active=0, idle=0, qSize=0]
ignite_1              |     ^-- System thread pool [active=0, idle=5, qSize=0]
ignite_1              | [10:31:03,733][INFO][grid-timeout-worker-#71][IgniteKernal]

此后，Ignite Metrics 不断出现，节点不再 returns 任何输出或接受新任务。没有记录 JVM 暂停。

我做错了什么，如果同时查询运行，为什么 Ignite 冻结并且不响应？

如果您一次运行一个查询运行就可以正确查询。

PS: 我在 Nabble 论坛上阅读了大量 Ignite 文档和故障排除。任何帮助将不胜感激。

Answer 1

您很可能运行堆满了。使用该查询，特别是如果您没有正确的索引，您将使用大量堆，并且当您运行退出时，Java 几乎会冻结试图在循环。

你有什么指标？另外，如果您选择 sum(column1)，“按 column1 排序”对您意味着什么？

你能显示来自 Ignite 节点的日志吗？我会特别寻找“JVM 暂停”消息。

Answer 2

我能够解决问题！（犯了一个愚蠢的错误）

我使用 python 瘦客户端连接到 Ignite，我没有为每个查询创建一个新客户端，而是重复使用了一个全局客户端。

我想事情是这样的：由于客户端在发送一个查询执行后是运行第二个查询，因此未检索到第一个查询的输出。这阻止了 Ignite 运行第二个查询和检索第一个查询的输出。

如果对 Ignite 有更多了解的人可以纠正我或告诉我我是否正确，将不胜感激。

为什么同时执行任务失败并导致 Ignite 冻结？

Why are simultaneously executing tasks failing and causing Ignite to freeze?

python-3.x

ignite