弹性搜索崩溃
Elasticsearch crashing
我们遇到 Elasticsearch 不时崩溃的问题。它有时还会使 RAM + CPU 激增,服务器变得无响应。
我们保留了大部分设置,但必须向 JVM 堆 (48GB) 添加更多 RAM 以使其不会经常崩溃。
我开始挖掘,显然 32GB 是您应该使用的最大值。我们会对此进行调整。
服务器是:
CentOS 7
RAM: 125GB
CPU: 40 threads
Hard Drive: 2x Raid 1 NVME
^^^ 有足够多的硬件来处理这样的事情,但有些事情告诉我需要做更多的配置来处理这么多数据。
我们是 运行 一家 Magento 2.4.3 CE 商店,拥有大约 400,000 件产品。
这是我们所有的配置文件:
jvm.options 文件
## JVM configuration
################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################
# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space
-Xms48g
-Xmx48g
################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################
## GC configuration
8-13:-XX:+UseConcMarkSweepGC
8-13:-XX:CMSInitiatingOccupancyFraction=75
8-13:-XX:+UseCMSInitiatingOccupancyOnly
## G1GC Configuration
# NOTE: G1 GC is only supported on JDK version 10 or later
# to use G1GC, uncomment the next two lines and update the version on the
# following three lines to your version of the JDK
# 10-13:-XX:-UseConcMarkSweepGC
# 10-13:-XX:-UseCMSInitiatingOccupancyOnly
14-:-XX:+UseG1GC
14-:-XX:G1ReservePercent=25
14-:-XX:InitiatingHeapOccupancyPercent=30
## DNS cache policy
# cache ttl in seconds for positive DNS lookups noting that this overrides the
# JDK security property networkaddress.cache.ttl; set to -1 to cache forever
-Des.networkaddress.cache.ttl=60
# cache ttl in seconds for negative DNS lookups noting that this overrides the
# JDK security property networkaddress.cache.negative ttl; set to -1 to cache
# forever
-Des.networkaddress.cache.negative.ttl=10
## optimizations
# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch
## basic
# explicitly set the stack size
-Xss1m
# set to headless, just in case
-Djava.awt.headless=true
# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8
# use our provided JNA always versus the system one
-Djna.nosys=true
# turn off a JDK optimization that throws away stack traces for common
# exceptions because stack traces are important for debugging
-XX:-OmitStackTraceInFastThrow
# enable helpful NullPointerExceptions (https://openjdk.java.net/jeps/358), if
# they are supported
14-:-XX:+ShowCodeDetailsInExceptionMessages
# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Djava.io.tmpdir=${ES_TMPDIR}
## heap dumps
# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError
# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=/var/lib/elasticsearch
# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log
## JDK 8 GC logging
8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:/var/log/elasticsearch/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m
# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
# due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
# time/date parsing will break in an incompatible way for some date patterns and locals
9-:-Djava.locale.providers=COMPAT
# temporary workaround for C2 bug with JDK 10 on hardware with AVX-512
10-:-XX:UseAVX=2
**elasticsearch.yml**
# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
# Before you set out to tweak and tune the configuration, make sure you
# understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
#cluster.name: my-application
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
#node.name: node-1
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /var/lib/elasticsearch
#
# Path to log files:
#
path.logs: /var/log/elasticsearch
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
#network.host: 192.168.0.1
#
# Set a custom port for HTTP:
#
#http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.zen.ping.unicast.hosts: ["host1", "host2"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
#discovery.zen.minimum_master_nodes:
#
# For more information, consult the zen discovery module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
xpack.security.enabled: true
我研究了 RAM + CPU 峰值可能是由于未设置这些设置造成的:
gateway.expected_nodes: 10
gateway.recover_after_time: 5m
这是来自 Elasticsearch 的一些其他数据:
curl -XGET --user username:password http://localhost:9200/
{
"name" : "web1.example.com",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "S8fFQ993QDWkLY8lZtp_mQ",
"version" : {
"number" : "7.13.2",
"build_flavor" : "default",
"build_type" : "rpm",
"build_hash" : "4d960a0733be83dd2543ca018aa4ddc42e956800",
"build_date" : "2021-06-10T21:01:55.251515791Z",
"build_snapshot" : false,
"lucene_version" : "8.8.2",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
curl --user username:password -sS http://localhost:9200/_cluster/health?pretty
{
"cluster_name" : "elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 5,
"active_shards" : 5,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 4,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 55.55555555555556
}
curl --user username:password -sS http://localhost:9200/_cluster/allocation/explain?pretty
{
"index" : "example-amasty_product_1_v156",
"shard" : 0,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "INDEX_CREATED",
"at" : "2021-09-14T16:52:28.854Z",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "2THEUTSaQdmOJAAhTTN71g",
"node_name" : "web1.example.com",
"transport_address" : "127.0.0.1:9300",
"node_attributes" : {
"ml.machine_memory" : "134622244864",
"xpack.installed" : "true",
"transform.node" : "true",
"ml.max_open_jobs" : "512",
"ml.max_jvm_size" : "51539607552"
},
"node_decision" : "no",
"weight_ranking" : 1,
"deciders" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "a copy of this shard is already allocated to this node"
}
]
}
]
}
^^^ 问题是我不知道如何在一台机器上设置多个节点。
据我了解,配置错误是我们 运行 只有一个节点 。根据我的阅读,绿色状态需要 3 个主节点。
如何在单机上设置多个节点,是否需要增加数据节点?
我的主要怀疑:
- 没有足够的主/数据节点
- 较新的垃圾收集器有问题(启用了 G1GC - 我不确定如何从配置中确定当前启用了哪个)--- 已经弄清楚了 - 使用了 G1。
- 没有崩溃时的恢复设置(gateway.expected_nodes、gateway.recover_after_time)
更新:
这是来自 elasticsearch.log
的错误日志
https://privfile.com/download.php?fid=6141256d5e639-MTAwNTM=
抱歉日志文件不适合 Whosebug post :)
粘贴站:
第 1 部分:https://pastebin.com/86sLM9BD
第 2 部分:https://pastebin.com/1VEn63TQ
更新:
输出:_cluster/stats?pretty&human
更新:
想出如何限制副本的数量。
这可以通过模板完成:
PUT _template/all
{
"template": "*",
"settings": {
"number_of_replicas": 0
}
}
我明天会测试它是否有效并使状态变为绿色。
我认为它不会在性能方面有任何改进,但我们拭目以待。
我正在研究其他建议:
- RAM 使用限制为 31GB
- 文件描述符已设置为 65535
- 最大线程数已设置为 4096
- 最大大小虚拟内存检查已经增加并配置
- 最大地图数增加到 262144
- G1GC 被禁用(默认)
我正在尝试的一件事是减少:
8-13:-XX:CMSInitiatingOccupancyFraction=75
至
8-13:-XX:CMSInitiatingOccupancyFraction=70
我相信这会加快垃圾收集速度并防止内存不足错误。我们将尝试调整此 up/down 以查看是否有帮助。
切换到 G1GC
我意识到这并不值得鼓励,但是有一些文章介绍了如何处理类似的内存不足问题,其中切换到 G1GC 有助于解决问题:https://medium.com/naukri-engineering/garbage-collection-in-elasticsearch-and-the-g1gc-16b79a447181
这将是我要尝试的最后一件事。
更新:
经过所有这些更改后,索引终于变绿了(模板修复成功)。
一夜之间也没有任何问题。它不像 50GB RAM 时那样活泼,但至少它很稳定。
对未来 Elasticsearch 问题排查者的一般建议:通过 bootstrap checks - 这至少会让您处于性能的基线。
更新:发现 JVM 从两个位置抓取设置并将它们用于不同目的的问题。
看起来系统管理员将 heap_size.options 放入
/etc/elasticsearch/jvm.options.d
JVM 设置为 31GB,但主 jvm.options 文件显示 8GB。这影响了 运行 只有 8GB RAM 的 GC 收集线程(但仍然占用了所有 31GB RAM)。
我删除了文件并向 jvm.options 文件添加了 31GB。
这在一定程度上稳定了局势,但 GC 仍在高速收集。
只要我将任何属性添加到列表中进行索引,GC 收集就会再次溢出内存。
唯一可以挽救的是删除索引并重新建立索引。
我正处于考虑破坏整个 Elasticsearch 安装然后自己安装的地步。
这应该不难。
几件事
- 高 cpu 或内存使用不会因为没有设置那些
gateway
设置,并且作为单节点集群,它们有些无关紧要
- 我们建议保持堆 <32GB,请参阅 https://www.elastic.co/guide/en/elasticsearch/reference/current/advanced-configuration.html#set-jvm-heap-size
- 您永远不能在与主节点相同的节点上分配副本分片。因此,对于单节点集群,您要么需要删除副本(有风险),要么向集群添加另一个(理想情况下)2 个节点
- 在同一主机上设置多节点集群有点毫无意义。确定您的副本将被分配,但如果您丢失主机,您无论如何都会丢失所有数据
我建议查看 https://www.elastic.co/guide/en/elasticsearch/reference/7.14/bootstrap-checks.html 并应用其中提到的设置,因为即使您是 运行 单个节点,这些也就是我们所说的生产就绪设置
除此之外,您是否启用了监控?您的 Elasticsearch 日志显示什么?热线程呢?还是慢日志?
(顺便说一句,它是 Elasticsearch,s 不是驼峰式;))
我们已经解决了这个问题。问题是安装错误。
有些地方没有正常工作(仍然不知道确切的问题是什么)。
ES 和 Java 都已重新安装。我已将 ES 与在我的开发环境中运行的特定版本相匹配。
你可以在这里看到GC终于正常工作了。
我们还直接从源代码中获取了 ES。之前的安装来自某个随机仓库。
我把公司需要的所有属性都扔进去了,它甚至没有注意到-稳定和快速。
感谢所有帮助我完成这些步骤的人,因为我不会在不知道我已尽一切可能使它稳定的情况下破坏 ES 安装。
这也给我上了一堂配置ES的课:)
我们遇到 Elasticsearch 不时崩溃的问题。它有时还会使 RAM + CPU 激增,服务器变得无响应。
我们保留了大部分设置,但必须向 JVM 堆 (48GB) 添加更多 RAM 以使其不会经常崩溃。
我开始挖掘,显然 32GB 是您应该使用的最大值。我们会对此进行调整。
服务器是:
CentOS 7
RAM: 125GB
CPU: 40 threads
Hard Drive: 2x Raid 1 NVME
^^^ 有足够多的硬件来处理这样的事情,但有些事情告诉我需要做更多的配置来处理这么多数据。
我们是 运行 一家 Magento 2.4.3 CE 商店,拥有大约 400,000 件产品。
这是我们所有的配置文件:
jvm.options 文件
## JVM configuration
################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################
# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space
-Xms48g
-Xmx48g
################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################
## GC configuration
8-13:-XX:+UseConcMarkSweepGC
8-13:-XX:CMSInitiatingOccupancyFraction=75
8-13:-XX:+UseCMSInitiatingOccupancyOnly
## G1GC Configuration
# NOTE: G1 GC is only supported on JDK version 10 or later
# to use G1GC, uncomment the next two lines and update the version on the
# following three lines to your version of the JDK
# 10-13:-XX:-UseConcMarkSweepGC
# 10-13:-XX:-UseCMSInitiatingOccupancyOnly
14-:-XX:+UseG1GC
14-:-XX:G1ReservePercent=25
14-:-XX:InitiatingHeapOccupancyPercent=30
## DNS cache policy
# cache ttl in seconds for positive DNS lookups noting that this overrides the
# JDK security property networkaddress.cache.ttl; set to -1 to cache forever
-Des.networkaddress.cache.ttl=60
# cache ttl in seconds for negative DNS lookups noting that this overrides the
# JDK security property networkaddress.cache.negative ttl; set to -1 to cache
# forever
-Des.networkaddress.cache.negative.ttl=10
## optimizations
# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch
## basic
# explicitly set the stack size
-Xss1m
# set to headless, just in case
-Djava.awt.headless=true
# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8
# use our provided JNA always versus the system one
-Djna.nosys=true
# turn off a JDK optimization that throws away stack traces for common
# exceptions because stack traces are important for debugging
-XX:-OmitStackTraceInFastThrow
# enable helpful NullPointerExceptions (https://openjdk.java.net/jeps/358), if
# they are supported
14-:-XX:+ShowCodeDetailsInExceptionMessages
# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Djava.io.tmpdir=${ES_TMPDIR}
## heap dumps
# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError
# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=/var/lib/elasticsearch
# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log
## JDK 8 GC logging
8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:/var/log/elasticsearch/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m
# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
# due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
# time/date parsing will break in an incompatible way for some date patterns and locals
9-:-Djava.locale.providers=COMPAT
# temporary workaround for C2 bug with JDK 10 on hardware with AVX-512
10-:-XX:UseAVX=2
**elasticsearch.yml**
# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
# Before you set out to tweak and tune the configuration, make sure you
# understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
#cluster.name: my-application
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
#node.name: node-1
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /var/lib/elasticsearch
#
# Path to log files:
#
path.logs: /var/log/elasticsearch
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
#network.host: 192.168.0.1
#
# Set a custom port for HTTP:
#
#http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.zen.ping.unicast.hosts: ["host1", "host2"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
#discovery.zen.minimum_master_nodes:
#
# For more information, consult the zen discovery module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
xpack.security.enabled: true
我研究了 RAM + CPU 峰值可能是由于未设置这些设置造成的:
gateway.expected_nodes: 10
gateway.recover_after_time: 5m
这是来自 Elasticsearch 的一些其他数据:
curl -XGET --user username:password http://localhost:9200/
{
"name" : "web1.example.com",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "S8fFQ993QDWkLY8lZtp_mQ",
"version" : {
"number" : "7.13.2",
"build_flavor" : "default",
"build_type" : "rpm",
"build_hash" : "4d960a0733be83dd2543ca018aa4ddc42e956800",
"build_date" : "2021-06-10T21:01:55.251515791Z",
"build_snapshot" : false,
"lucene_version" : "8.8.2",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
curl --user username:password -sS http://localhost:9200/_cluster/health?pretty
{
"cluster_name" : "elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 5,
"active_shards" : 5,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 4,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 55.55555555555556
}
curl --user username:password -sS http://localhost:9200/_cluster/allocation/explain?pretty
{
"index" : "example-amasty_product_1_v156",
"shard" : 0,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "INDEX_CREATED",
"at" : "2021-09-14T16:52:28.854Z",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "2THEUTSaQdmOJAAhTTN71g",
"node_name" : "web1.example.com",
"transport_address" : "127.0.0.1:9300",
"node_attributes" : {
"ml.machine_memory" : "134622244864",
"xpack.installed" : "true",
"transform.node" : "true",
"ml.max_open_jobs" : "512",
"ml.max_jvm_size" : "51539607552"
},
"node_decision" : "no",
"weight_ranking" : 1,
"deciders" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "a copy of this shard is already allocated to this node"
}
]
}
]
}
^^^ 问题是我不知道如何在一台机器上设置多个节点。
据我了解,配置错误是我们 运行 只有一个节点 。根据我的阅读,绿色状态需要 3 个主节点。
如何在单机上设置多个节点,是否需要增加数据节点?
我的主要怀疑:
- 没有足够的主/数据节点
- 较新的垃圾收集器有问题(启用了 G1GC - 我不确定如何从配置中确定当前启用了哪个)--- 已经弄清楚了 - 使用了 G1。
- 没有崩溃时的恢复设置(gateway.expected_nodes、gateway.recover_after_time)
更新:
这是来自 elasticsearch.log
的错误日志https://privfile.com/download.php?fid=6141256d5e639-MTAwNTM=
抱歉日志文件不适合 Whosebug post :)
粘贴站:
第 1 部分:https://pastebin.com/86sLM9BD 第 2 部分:https://pastebin.com/1VEn63TQ
更新:
输出:_cluster/stats?pretty&human
更新:
想出如何限制副本的数量。
这可以通过模板完成:
PUT _template/all
{
"template": "*",
"settings": {
"number_of_replicas": 0
}
}
我明天会测试它是否有效并使状态变为绿色。
我认为它不会在性能方面有任何改进,但我们拭目以待。
我正在研究其他建议:
- RAM 使用限制为 31GB
- 文件描述符已设置为 65535
- 最大线程数已设置为 4096
- 最大大小虚拟内存检查已经增加并配置
- 最大地图数增加到 262144
- G1GC 被禁用(默认)
我正在尝试的一件事是减少:
8-13:-XX:CMSInitiatingOccupancyFraction=75
至
8-13:-XX:CMSInitiatingOccupancyFraction=70
我相信这会加快垃圾收集速度并防止内存不足错误。我们将尝试调整此 up/down 以查看是否有帮助。
切换到 G1GC
我意识到这并不值得鼓励,但是有一些文章介绍了如何处理类似的内存不足问题,其中切换到 G1GC 有助于解决问题:https://medium.com/naukri-engineering/garbage-collection-in-elasticsearch-and-the-g1gc-16b79a447181
这将是我要尝试的最后一件事。
更新:
经过所有这些更改后,索引终于变绿了(模板修复成功)。
一夜之间也没有任何问题。它不像 50GB RAM 时那样活泼,但至少它很稳定。
对未来 Elasticsearch 问题排查者的一般建议:通过 bootstrap checks - 这至少会让您处于性能的基线。
更新:发现 JVM 从两个位置抓取设置并将它们用于不同目的的问题。
看起来系统管理员将 heap_size.options 放入
/etc/elasticsearch/jvm.options.d
JVM 设置为 31GB,但主 jvm.options 文件显示 8GB。这影响了 运行 只有 8GB RAM 的 GC 收集线程(但仍然占用了所有 31GB RAM)。
我删除了文件并向 jvm.options 文件添加了 31GB。
这在一定程度上稳定了局势,但 GC 仍在高速收集。
只要我将任何属性添加到列表中进行索引,GC 收集就会再次溢出内存。
唯一可以挽救的是删除索引并重新建立索引。
我正处于考虑破坏整个 Elasticsearch 安装然后自己安装的地步。
这应该不难。
几件事
- 高 cpu 或内存使用不会因为没有设置那些
gateway
设置,并且作为单节点集群,它们有些无关紧要 - 我们建议保持堆 <32GB,请参阅 https://www.elastic.co/guide/en/elasticsearch/reference/current/advanced-configuration.html#set-jvm-heap-size
- 您永远不能在与主节点相同的节点上分配副本分片。因此,对于单节点集群,您要么需要删除副本(有风险),要么向集群添加另一个(理想情况下)2 个节点
- 在同一主机上设置多节点集群有点毫无意义。确定您的副本将被分配,但如果您丢失主机,您无论如何都会丢失所有数据
我建议查看 https://www.elastic.co/guide/en/elasticsearch/reference/7.14/bootstrap-checks.html 并应用其中提到的设置,因为即使您是 运行 单个节点,这些也就是我们所说的生产就绪设置
除此之外,您是否启用了监控?您的 Elasticsearch 日志显示什么?热线程呢?还是慢日志?
(顺便说一句,它是 Elasticsearch,s 不是驼峰式;))
我们已经解决了这个问题。问题是安装错误。
有些地方没有正常工作(仍然不知道确切的问题是什么)。
ES 和 Java 都已重新安装。我已将 ES 与在我的开发环境中运行的特定版本相匹配。
你可以在这里看到GC终于正常工作了。
我们还直接从源代码中获取了 ES。之前的安装来自某个随机仓库。
我把公司需要的所有属性都扔进去了,它甚至没有注意到-稳定和快速。
感谢所有帮助我完成这些步骤的人,因为我不会在不知道我已尽一切可能使它稳定的情况下破坏 ES 安装。
这也给我上了一堂配置ES的课:)