弹性搜索崩溃

Question

我们遇到 Elasticsearch 不时崩溃的问题。它有时还会使 RAM + CPU 激增，服务器变得无响应。

我们保留了大部分设置，但必须向 JVM 堆 (48GB) 添加更多 RAM 以使其不会经常崩溃。

我开始挖掘，显然 32GB 是您应该使用的最大值。我们会对此进行调整。

服务器是：

CentOS 7
RAM: 125GB
CPU: 40 threads
Hard Drive: 2x Raid 1 NVME

^^^ 有足够多的硬件来处理这样的事情，但有些事情告诉我需要做更多的配置来处理这么多数据。

我们是运行一家 Magento 2.4.3 CE 商店，拥有大约 400,000 件产品。

这是我们所有的配置文件：

jvm.options 文件

    ## JVM configuration
    
    ################################################################
    ## IMPORTANT: JVM heap size
    ################################################################
    ##
    ## You should always set the min and max JVM heap
    ## size to the same value. For example, to set
    ## the heap to 4 GB, set:
    ##
    ## -Xms4g
    ## -Xmx4g
    ##
    ## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
    ## for more information
    ##
    ################################################################
    
    # Xms represents the initial size of total heap space
    # Xmx represents the maximum size of total heap space
    
    -Xms48g
    -Xmx48g
    
    ################################################################
    ## Expert settings
    ################################################################
    ##
    ## All settings below this section are considered
    ## expert settings. Don't tamper with them unless
    ## you understand what you are doing
    ##
    ################################################################
    
    ## GC configuration
    8-13:-XX:+UseConcMarkSweepGC
    8-13:-XX:CMSInitiatingOccupancyFraction=75
    8-13:-XX:+UseCMSInitiatingOccupancyOnly
    
    ## G1GC Configuration
    # NOTE: G1 GC is only supported on JDK version 10 or later
    # to use G1GC, uncomment the next two lines and update the version on the
    # following three lines to your version of the JDK
    # 10-13:-XX:-UseConcMarkSweepGC
    # 10-13:-XX:-UseCMSInitiatingOccupancyOnly
    14-:-XX:+UseG1GC
    14-:-XX:G1ReservePercent=25
    14-:-XX:InitiatingHeapOccupancyPercent=30
    
    ## DNS cache policy
    # cache ttl in seconds for positive DNS lookups noting that this overrides the
    # JDK security property networkaddress.cache.ttl; set to -1 to cache forever
    -Des.networkaddress.cache.ttl=60
    # cache ttl in seconds for negative DNS lookups noting that this overrides the
    # JDK security property networkaddress.cache.negative ttl; set to -1 to cache
    # forever
    -Des.networkaddress.cache.negative.ttl=10
    
    ## optimizations
    
    # pre-touch memory pages used by the JVM during initialization
    -XX:+AlwaysPreTouch
    
    ## basic
    
    # explicitly set the stack size
    -Xss1m
    
    # set to headless, just in case
    -Djava.awt.headless=true
    
    # ensure UTF-8 encoding by default (e.g. filenames)
    -Dfile.encoding=UTF-8
    
    # use our provided JNA always versus the system one
    -Djna.nosys=true
    
    # turn off a JDK optimization that throws away stack traces for common
    # exceptions because stack traces are important for debugging
    -XX:-OmitStackTraceInFastThrow
    
    # enable helpful NullPointerExceptions (https://openjdk.java.net/jeps/358), if
    # they are supported
    14-:-XX:+ShowCodeDetailsInExceptionMessages
    
    # flags to configure Netty
    -Dio.netty.noUnsafe=true
    -Dio.netty.noKeySetOptimization=true
    -Dio.netty.recycler.maxCapacityPerThread=0
    
    # log4j 2
    -Dlog4j.shutdownHookEnabled=false
    -Dlog4j2.disable.jmx=true
    
    -Djava.io.tmpdir=${ES_TMPDIR}
    
    ## heap dumps
    
    # generate a heap dump when an allocation from the Java heap fails
    # heap dumps are created in the working directory of the JVM
    -XX:+HeapDumpOnOutOfMemoryError
    
    # specify an alternative path for heap dumps; ensure the directory exists and
    # has sufficient space
    -XX:HeapDumpPath=/var/lib/elasticsearch
    
    # specify an alternative path for JVM fatal error logs
    -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log
    
    ## JDK 8 GC logging
    
    8:-XX:+PrintGCDetails
    8:-XX:+PrintGCDateStamps
    8:-XX:+PrintTenuringDistribution
    8:-XX:+PrintGCApplicationStoppedTime
    8:-Xloggc:/var/log/elasticsearch/gc.log
    8:-XX:+UseGCLogFileRotation
    8:-XX:NumberOfGCLogFiles=32
    8:-XX:GCLogFileSize=64m
    
    # JDK 9+ GC logging
    9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
    # due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
    # time/date parsing will break in an incompatible way for some date patterns and locals
    9-:-Djava.locale.providers=COMPAT
    
    # temporary workaround for C2 bug with JDK 10 on hardware with AVX-512
    10-:-XX:UseAVX=2


**elasticsearch.yml**

# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
#cluster.name: my-application
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
#node.name: node-1
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /var/lib/elasticsearch
#
# Path to log files:
#
path.logs: /var/log/elasticsearch
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
#network.host: 192.168.0.1
#
# Set a custom port for HTTP:
#
#http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.zen.ping.unicast.hosts: ["host1", "host2"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
#discovery.zen.minimum_master_nodes: 
#
# For more information, consult the zen discovery module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
xpack.security.enabled: true

我研究了 RAM + CPU 峰值可能是由于未设置这些设置造成的：

gateway.expected_nodes: 10
gateway.recover_after_time: 5m

这是来自 Elasticsearch 的一些其他数据：

curl -XGET --user username:password http://localhost:9200/

{
  "name" : "web1.example.com",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "S8fFQ993QDWkLY8lZtp_mQ",
  "version" : {
    "number" : "7.13.2",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_hash" : "4d960a0733be83dd2543ca018aa4ddc42e956800",
    "build_date" : "2021-06-10T21:01:55.251515791Z",
    "build_snapshot" : false,
    "lucene_version" : "8.8.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

curl --user username:password -sS http://localhost:9200/_cluster/health?pretty

{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 5,
  "active_shards" : 5,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 4,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 55.55555555555556
}

curl --user username:password -sS http://localhost:9200/_cluster/allocation/explain?pretty

{
  "index" : "example-amasty_product_1_v156",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "INDEX_CREATED",
    "at" : "2021-09-14T16:52:28.854Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "2THEUTSaQdmOJAAhTTN71g",
      "node_name" : "web1.example.com",
      "transport_address" : "127.0.0.1:9300",
      "node_attributes" : {
        "ml.machine_memory" : "134622244864",
        "xpack.installed" : "true",
        "transform.node" : "true",
        "ml.max_open_jobs" : "512",
        "ml.max_jvm_size" : "51539607552"
      },
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node"
        }
      ]
    }
  ]
}

^^^ 问题是我不知道如何在一台机器上设置多个节点。

据我了解，配置错误是我们运行 只有一个节点 。根据我的阅读，绿色状态需要 3 个主节点。

如何在单机上设置多个节点，是否需要增加数据节点？

我的主要怀疑：

没有足够的主/数据节点
较新的垃圾收集器有问题（启用了 G1GC - 我不确定如何从配置中确定当前启用了哪个）--- 已经弄清楚了 - 使用了 G1。
没有崩溃时的恢复设置（gateway.expected_nodes、gateway.recover_after_time）

更新：

这是来自 elasticsearch.log

的错误日志

https://privfile.com/download.php?fid=6141256d5e639-MTAwNTM=

抱歉日志文件不适合 Whosebug post :)

粘贴站：

第 1 部分：https://pastebin.com/86sLM9BD 第 2 部分：https://pastebin.com/1VEn63TQ

更新：

输出：_cluster/stats?pretty&human

https://pastebin.com/EM8ZMVst

更新：

想出如何限制副本的数量。

这可以通过模板完成：

PUT _template/all
{
  "template": "*",
  "settings": {
    "number_of_replicas": 0
  }
}

我明天会测试它是否有效并使状态变为绿色。

我认为它不会在性能方面有任何改进，但我们拭目以待。

我正在研究其他建议：

RAM 使用限制为 31GB
文件描述符已设置为 65535
最大线程数已设置为 4096
最大大小虚拟内存检查已经增加并配置
最大地图数增加到 262144
G1GC 被禁用（默认）

我正在尝试的一件事是减少：

8-13:-XX:CMSInitiatingOccupancyFraction=75

至

8-13:-XX:CMSInitiatingOccupancyFraction=70

我相信这会加快垃圾收集速度并防止内存不足错误。我们将尝试调整此 up/down 以查看是否有帮助。

切换到 G1GC

我意识到这并不值得鼓励，但是有一些文章介绍了如何处理类似的内存不足问题，其中切换到 G1GC 有助于解决问题：https://medium.com/naukri-engineering/garbage-collection-in-elasticsearch-and-the-g1gc-16b79a447181

这将是我要尝试的最后一件事。

更新：

经过所有这些更改后，索引终于变绿了（模板修复成功）。

一夜之间也没有任何问题。它不像 50GB RAM 时那样活泼，但至少它很稳定。

对未来 Elasticsearch 问题排查者的一般建议：通过 bootstrap checks - 这至少会让您处于性能的基线。

更新：发现 JVM 从两个位置抓取设置并将它们用于不同目的的问题。

看起来系统管理员将 heap_size.options 放入

/etc/elasticsearch/jvm.options.d

JVM 设置为 31GB，但主 jvm.options 文件显示 8GB。这影响了运行只有 8GB RAM 的 GC 收集线程（但仍然占用了所有 31GB RAM）。

我删除了文件并向 jvm.options 文件添加了 31GB。

这在一定程度上稳定了局势，但 GC 仍在高速收集。

只要我将任何属性添加到列表中进行索引，GC 收集就会再次溢出内存。

唯一可以挽救的是删除索引并重新建立索引。

我正处于考虑破坏整个 Elasticsearch 安装然后自己安装的地步。

这应该不难。

Answer 1

几件事

高 cpu 或内存使用不会因为没有设置那些 gateway 设置，并且作为单节点集群，它们有些无关紧要
我们建议保持堆 <32GB，请参阅 https://www.elastic.co/guide/en/elasticsearch/reference/current/advanced-configuration.html#set-jvm-heap-size
您永远不能在与主节点相同的节点上分配副本分片。因此，对于单节点集群，您要么需要删除副本（有风险），要么向集群添加另一个（理想情况下）2 个节点
在同一主机上设置多节点集群有点毫无意义。确定您的副本将被分配，但如果您丢失主机，您无论如何都会丢失所有数据

我建议查看 https://www.elastic.co/guide/en/elasticsearch/reference/7.14/bootstrap-checks.html 并应用其中提到的设置，因为即使您是运行单个节点，这些也就是我们所说的生产就绪设置

除此之外，您是否启用了监控？您的 Elasticsearch 日志显示什么？热线程呢？还是慢日志？

（顺便说一句，它是 Elasticsearch，s 不是驼峰式；））

Answer 2

我们已经解决了这个问题。问题是安装错误。

有些地方没有正常工作（仍然不知道确切的问题是什么）。

ES 和 Java 都已重新安装。我已将 ES 与在我的开发环境中运行的特定版本相匹配。

你可以在这里看到GC终于正常工作了。

我们还直接从源代码中获取了 ES。之前的安装来自某个随机仓库。

我把公司需要的所有属性都扔进去了，它甚至没有注意到-稳定和快速。

感谢所有帮助我完成这些步骤的人，因为我不会在不知道我已尽一切可能使它稳定的情况下破坏 ES 安装。

这也给我上了一堂配置ES的课:)

弹性搜索崩溃

Elasticsearch crashing

elasticsearch

magento2

elasticsearch-7