间歇性 SQL 连接错误

Intermittent SQL connection error

我有一个 ASP.NET (Sitecore) 应用程序,日志显示我们客户的生产环境中出现间歇性 SQL 连接错误。异常情况如下:

Exception: System.Data.SqlClient.SqlException
Message: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)
Source: .Net SqlClient Data Provider
   at System.Data.SqlClient.SqlInternalConnectionTds..ctor(DbConnectionPoolIdentity identity, SqlConnectionString connectionOptions, SqlCredential credential, Object providerInfo, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance, SqlConnectionString userConnectionOptions, SessionData reconnectSessionData, DbConnectionPool pool, String accessToken, Boolean applyTransientFaultHandling)
   at System.Data.SqlClient.SqlConnectionFactory.CreateConnection(DbConnectionOptions options, DbConnectionPoolKey poolKey, Object poolGroupProviderInfo, DbConnectionPool pool, DbConnection owningConnection, DbConnectionOptions userOptions)
   at System.Data.ProviderBase.DbConnectionFactory.CreatePooledConnection(DbConnectionPool pool, DbConnection owningObject, DbConnectionOptions options, DbConnectionPoolKey poolKey, DbConnectionOptions userOptions)
   at System.Data.ProviderBase.DbConnectionPool.CreateObject(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
   at System.Data.ProviderBase.DbConnectionPool.UserCreateRequest(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
   at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at System.Data.SqlClient.SqlConnection.TryOpenInner(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.Open()
   at Sitecore.Data.DataProviders.Sql.DataProviderCommand..ctor(IDbCommand command, DataProviderTransaction transaction, Boolean openConnection)
   at Sitecore.Data.DataProviders.Sql.SqlDataApi.<>c__DisplayClass4.<CreateCommand>b__3()
   at Sitecore.Data.DataProviders.NullRetryer.Execute[T](Func`1 action, Action recover)
   at Sitecore.Data.DataProviders.Sql.SqlDataApi.<>c__DisplayClass12.<CreateReader>b__10()
   at Sitecore.Data.DataProviders.NullRetryer.Execute[T](Func`1 action, Action recover)
   at Sitecore.Data.DataProviders.Sql.SqlDataApi.CreateReader(String sql, Object[] parameters)
   at Sitecore.Data.DataProviders.Sql.SqlDataApi.<CreateObjectReader>d__6`1.MoveNext()
   at System.Linq.Enumerable.FirstOrDefault[TSource](IEnumerable`1 source)
   at Sitecore.Eventing.EventQueue.ProcessEvents(Action`2 handler)
   at Sitecore.Eventing.EventProvider.RaiseQueuedEvents()

Nested Exception

Exception: System.ComponentModel.Win32Exception
Message: The network path was not found


6420 16:53:53 ERROR Exception processing remote events from database: web
Exception: System.Data.SqlClient.SqlException
Message: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)
Source: .Net SqlClient Data Provider
   at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at System.Data.SqlClient.SqlConnection.TryOpenInner(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.Open()
   at Sitecore.Data.DataProviders.Sql.DataProviderCommand..ctor(IDbCommand command, DataProviderTransaction transaction, Boolean openConnection)
   at Sitecore.Data.DataProviders.Sql.SqlDataApi.<>c__DisplayClass4.<CreateCommand>b__3()
   at Sitecore.Data.DataProviders.NullRetryer.Execute[T](Func`1 action, Action recover)
   at Sitecore.Data.DataProviders.Sql.SqlDataApi.<>c__DisplayClass12.<CreateReader>b__10()
   at Sitecore.Data.DataProviders.NullRetryer.Execute[T](Func`1 action, Action recover)
   at Sitecore.Data.DataProviders.Sql.SqlDataApi.CreateReader(String sql, Object[] parameters)
   at Sitecore.Data.DataProviders.Sql.SqlDataApi.<CreateObjectReader>d__6`1.MoveNext()
   at System.Linq.Enumerable.FirstOrDefault[TSource](IEnumerable`1 source)
   at Sitecore.Eventing.EventQueue.ProcessEvents(Action`2 handler)
   at Sitecore.Eventing.EventProvider.RaiseQueuedEvents()

Nested Exception

Exception: System.ComponentModel.Win32Exception
Message: The network path was not found

如果您提供了无效的连接字符串,而您指定的服务器不存在或无法在网络上访问,则此错误很典型。但是,该站点在 99% 的时间内都运行良好。

在此示例中,错误来自 Sitecore 的 RaiseQueuedEvents 计划任务,但其他地方也会引发异常,包括在站点中点击 URL 时,导致 http 500.

有趣的是,它们是一波接一波地出现的,也就是说,在几秒钟内 space 中可能会出现多达 100 个这样的异常。

我们客户的管理服务器的基础架构团队非常坚定地认为这不是网络问题,而是应用程序代码有问题,并且报告说在这些异常似乎出现时数据库流量增加了发生:

(All these are CLEAR SPIKES compared to usual performance) - at 10:10:14 – number if user connections increased from 60 to 90 - at 10:10:14 – number of “batch requests / s” increased from around 60 to 650 - at 10:10:32 – “disk avg. READ time” increased from 1ms to 8.4ms - at 10:10:32 – network utilisation spiked from 0.3% to 18%

sql 监视器没有记录到网络中断,有 对服务器 CPU 利用率没有影响。

我不是网络或 SQL 性能方面的专家,但对我来说,这些统计数据似乎并不合理,或者会成为后续连接尝试接收 'network path not found' 异常的原因;如果服务器繁忙,我希望收到超时异常?

我联系了 Sitecore 支持人员,他们很快表示这是网络问题:

Based on these exceptions it doesn't seem they are Sitecore related. Messages clearly state that you have some kind of network error, so it will be appropriate to investigate further along with your Infrastructure team. I reviewed similar issues in our database and can highlight the following areas. - Remote connection was forcibly closed/disabled - Server was offline - Something related to wrong security context. Firewall and antiviruses may affect that.

目前,我们处于争执状态;我的感觉是错误消息表明 它一定是 网络问题,但他们的团队认为该网站在某种程度上已损坏。

如何诊断问题所在? code/sitecore 可能有问题还是网络问题?

更新:网络详细信息

数据库服务器托管在不同的网络上,我相信是通过 VLAN 联网的。服务器是负载平衡的,* 我认为 * 可以使用防火墙而不是适当的负载平衡器来完成。

更新 2

问题是 SQL 配置为同时允许 TCP 和命名管道。有时它会尝试连接不使用标准 SQL 端口的后者。解决方案是在连接字符串中的数据 source/server 前加上 Data Source=tcp:xxx.xxx.xx.xxx 以始终确保它通过 TCP

连接

它与 Sitecore 无关,但我之前在另一个内容管理系统中看到过非常相似的情况。我也遇到了类似的挑战,基础设施人员确信数据库服务器没有问题,问题出在网站上。

我怀疑是网络问题,所以我的做法是:

  1. 我写了一个 SQL 脚本,可以从命令行 运行 并且会显示相同的连接问题。

  2. 我 运行 Web 服务器上的脚本和我 运行 数据库服务器本身上的脚本。我记录了结果并进行了比较。

我的测试表明,当脚本 运行 在数据库服务器上时,错误根本无法重现,但当从 Web 服务器的命令行 运行 时,它确实发生了。这是支持我直觉的证据,即问题与连接相关,与网站或数据库服务器无关。

重点关注防火墙的设置,将网站的 DMZ 与 SQL 服务器所在的内部网络分开。这个防火墙是一对负载平衡的——我们最终能够找到一个细微的配置差异,它导致第二个盒子终止通过第一个盒子启动的连接。

该特定问题似乎不太可能导致您遇到问题 - 但您可能会发现提出有助于定位问题原因的测试的总体方法有帮助吗?