为什么 Hangfire 在轮询 sql 服务器的作业时每隔几秒就等待 15 秒?

Why does Hangfire wait for 15s every few seconds when polling sql server for jobs?

我继承了一个使用 Hangfire 和 sql 服务器作业存储的系统。通常当一个作业被安排为 运行 立即我们注意到它需要几秒钟才能被触发。

在我的开发环境中 运行ning 时查看 SQL Profiler,针对 Hangfire 数据库的 SQL 运行 看起来像这样 -

exec sp_executesql N'delete top (1) JQ
output DELETED.Id, DELETED.JobId, DELETED.Queue
from [HangFire].JobQueue JQ with (readpast, updlock, rowlock, forceseek)
where Queue in (@queues1) and (FetchedAt is null or FetchedAt < DATEADD(second, @timeout, GETUTCDATE()))',N'@queues1 nvarchar(4000),@timeout float',@queues1=N'MYQUEUENAME_master',@timeout=-1800

-- Exactly the same SQL as above is executed about 6 times/second for about 3-4 seconds,
-- then nothing for about 2 seconds, then: 

exec sp_getapplock @Resource=N'HangFire:recurring-jobs:lock',@DbPrincipal=N'public',@LockMode=N'Exclusive',@LockOwner=N'Session',@LockTimeout=5000
exec sp_getapplock @Resource=N'HangFire:locks:schedulepoller',@DbPrincipal=N'public',@LockMode=N'Exclusive',@LockOwner=N'Session',@LockTimeout=5000
exec sp_executesql N'select top (@count) Value from [HangFire].[Set] with (readcommittedlock, forceseek) where [Key] = @key and Score between @from and @to order by Score',N'@count int,@key nvarchar(4000),@from float,@to float',@count=1000,@key=N'recurring-jobs',@from=0,@to=1596053348
exec sp_executesql N'select top (@count) Value from [HangFire].[Set] with (readcommittedlock, forceseek) where [Key] = @key and Score between @from and @to order by Score',N'@count int,@key nvarchar(4000),@from float,@to float',@count=1000,@key=N'schedule',@from=0,@to=1596053348
exec sp_releaseapplock @Resource=N'HangFire:recurring-jobs:lock',@LockOwner=N'Session'
exec sp_releaseapplock @Resource=N'HangFire:locks:schedulepoller',@LockOwner=N'Session'

-- Then nothing is executed for about 8-10 seconds, then: 

exec sp_executesql N'update [HangFire].Server set LastHeartbeat = @now where Id = @id',N'@now datetime,@id nvarchar(4000)',@now='2020-07-29 20:09:19.097',@id=N'ps12345:19764:fe362d1a-5ee4-4d97-b70d-134fdfab2b87'

-- Then about 500ms-2s later I get 
exec sp_executesql N'delete top (1) JQ ... -- i.e. Same as first query
The update LastHeartbeat query is only there every second time (from just a brief inspection, maybe that’s not exactly right).

看起来至少有 3 个线程 运行ning 对 JQ 的 DELETE 查询,因为我可以在 RPC:Completed 之前看到几个 RPC:Starting,表明它们正在执行并行而不是顺序。 我不知道这是否正常,但看起来很奇怪,因为我认为我们只有一个工作的“消费者”。

我的开发环境中只有一个 Queue,虽然我猜在 live 中我们会有 20-50 个。

关于我应该在哪里查找导致的配置的任何建议: a) 检查作业之间的 8-10 秒暂停 b) 正在检查作业的线程数 - 好像我有太多


写完之后我意识到我们使用的是旧版本,所以我从 1.5.x 升级到 1.7.12,升级了数据库,并将启动配置更改为:

        app.UseHangfireDashboard();

        GlobalConfiguration.Configuration
            .UseSqlServerStorage(connstring, new SqlServerStorageOptions
            {
                CommandBatchMaxTimeout = TimeSpan.FromMinutes(5),
                QueuePollInterval = TimeSpan.Zero,
                SlidingInvisibilityTimeout = TimeSpan.FromMinutes(5),
                UseRecommendedIsolationLevel = true,
                PrepareSchemaIfNecessary = true, // Default value: true
                EnableHeavyMigrations = true     // Default value: false
            })
            .UseAutofacActivator(_container);
        JobActivator.Current = new AutofacJobActivator(_container);

但如果有的话,问题现在更糟了。或者相同但更快:现在在大约 1 秒内发生 20 次对 delete top (1) JQ... 的调用,然后是其他查询,然后等待 15 秒,然后一切重新开始。

需要明确的是,主要问题是,如果在 15 秒的延迟期间添加了任何作业,那么在执行我的作业之前将花费这 15 秒的剩余时间。我认为的第二个问题是它对 SQL 服务器的访问超出了需要:每秒 20 次有点多,至少对我来说是这样。

(交叉发布到 hangfire forums

我建议检查 Hangfire BackgroundJobServerOptions 以查看您在那里设置的轮询间隔。这将定义 hangfire 服务器检查队列中是否有任何作业要执行之前的时间。

来自文档

Hangfire Docs

Hangfire Server periodically checks the schedule to enqueue scheduled jobs to their queues, allowing workers to execute them. By default, check interval is equal to 15 seconds, but you can change it by setting the SchedulePollingInterval property on the options you pass to the BackgroundJobServer constructor:

var options = new BackgroundJobServerOptions
{
    SchedulePollingInterval = TimeSpan.FromMinutes(1)
};
var server = new BackgroundJobServer(options);

如果你不设置 QueuePollInterval 那么 Hangfire 与 sql 服务器存储默认为每 15 秒轮询一次。因此,如果遇到此问题,首先要做的是将 QueuePollInterval 设置为更小的值,例如1s.

但就我而言,即使我设置了它也没有任何效果。原因是在 之前调用 app.UseHangfireServer() 我用 SqlServerStorageOptions.

调用 GlobalConfiguration.Configuration.UseSqlServerStorage()

当您调用 app.UseHangfireServer() 时,它使用 JobStorage.Current 的当前值。我的代码设置为:

    var storage = new SqlServerStorage(connstring);
    JobStorage.Current = storage;

后来叫

    app.UseHangfireServer()

后来叫

        GlobalConfiguration.Configuration
            .UseSqlServerStorage(connstring, new SqlServerStorageOptions
        {
            CommandBatchMaxTimeout = TimeSpan.FromMinutes(5),
            QueuePollInterval = TimeSpan.Zero,
            SlidingInvisibilityTimeout = TimeSpan.FromMinutes(5),
            UseRecommendedIsolationLevel = true,
            PrepareSchemaIfNecessary = true, 
            EnableHeavyMigrations = true     
        })

app.UseHangfireServer() 之前重新排序以使用 SqlServerStorageOptions 意味着 SqlServerStorageOptions 生效。