ScheduledExecutorService 在 1 分钟的计划中延迟 10 分钟(systemd - journald 故障)

ScheduledExecutorService slips for 10 minutes on a 1 minute schedule (systemd - journald fault)

我有一个执行程序服务,它应该每分钟向磁盘写入一些东西。

安排是这样的:

    scheduledCacheDump = new ScheduledThreadPoolExecutor(1);
    scheduledCacheDump.scheduleAtFixedRate(this::saveCachedRecords,
                                           60,
                                           60,
                                           TimeUnit.SECONDS
    );

该任务使用由主线程填充的共享列表,因此它在该列表上同步:

   private void saveCachedRecords() {
        LOG.info(String.format("Scheduled record dump to disk. We have %d records to save.", recordCache.size()));
        synchronized (recordCache) {
            Iterator<Record> iterator = recordCache.iterator();
            while (iterator.hasNext()) {
               // save record to disk
               iterator.remove();
            }
        }
    }

我的列表是这样声明的:

private final List<Record> recordCache = new ArrayList<>();

主线程是批量接收数据的,所以每隔一秒左右,它就会接收到缓存在列表中的30条记录。其余时间它在等待套接字。

我不明白的是,从日志中可以看出,我的计划任务经常在超过 1 分钟的时间内被触发:

sept. 16 09:30:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:31:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:32:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:33:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:34:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:35:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:42:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:43:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:44:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:45:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:46:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:55:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:56:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:57:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:58:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:59:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 10:04:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 10:05:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 10:06:43 Scheduled record dump to disk. We have 27 records to save.

看看这个:

=> 5 分钟

甚至:

=> 9 分钟

我的日志在synchronized()范围内,所以我不知道任务是否真的按时调度并在锁上等待10分钟,或者它是否只是一个真正的调度问题。我将把它移出它,但通常我无法理解线程如何在大约每秒释放的锁上保持阻塞 10 分钟。

我该如何调查?

供参考:运行它的机器是 KVM 机器,这可能是一个因素吗?

天哪..

这根本不是 Java 的错。这是 systemd 的错。

我的进程作为 systemd 服务运行,所以我提取的日志来自 systemd-journald。 你猜怎么着,systemd 中有速率限制。当调度程序被触发时,我的守护进程会命中它,所以我得到了很多这样的行:

Suppressed 570 messages from /system.slice/xxx.service Suppressed 769 messages from /system.slice/xxx.service Suppressed 745 messages from /system.slice/xxx.service Suppressed 729 messages from /system.slice/xxx.service Suppressed 717 messages from /system.slice/xxx.service Suppressed 95 messages from /system.slice/xxx.service Suppressed 543 messages from /system.slice/xxx.service

所以..是的,我取消了对 journald 的限制,现在每分钟都有我的痕迹。

解决方案是:

  • 编辑/etc/systemd/journald.conf
  • 在末尾添加行:RateLimitInterval=0
  • 执行systemctl restart systemd-journald
  • 执行systemctl restart myservice

现在一切顺利

我会更新标题以供将来参考:)