GKE pods 请求负载高时出现连接问题
GKE pods connection issues when having high load of requests
我有一个 GKE v1.17.400 专用集群 运行 NAT 网关。在集群上,我有多个应用程序使用 google 服务,例如 stackdriver、pubsub 和 cloud sql。
我的应用程序在 .net-core 2.2 上 运行。它订阅并发布到 Pub/Sub 主题。
当负载很高时,我遇到与 Google 云服务的连接问题。
这个问题会导致许多不同的日志,例如:
与云的连接超时 sql:
MySql.Data.MySqlClient.MySqlException (0x80004005): Connect Timeout expired. ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
at MySqlConnector.Core.ServerSession.ConnectAsync(ConnectionSettings cs, ILoadBalancer loadBalancer, IOBehavior ioBehavior, CancellationToken cancellationToken) in C:\projects\mysqlconnector\src\MySqlConnector\Core\ServerSession.cs:line 360
at MySqlConnector.Core.ConnectionPool.GetSessionAsync(MySqlConnection connection, IOBehavior ioBehavior, CancellationToken cancellationToken) in C:\projects\mysqlconnector\src\MySqlConnector\Core\ConnectionPool.cs:line 112
at MySqlConnector.Core.ConnectionPool.GetSessionAsync(MySqlConnection connection, IOBehavior ioBehavior, CancellationToken cancellationToken) in C:\projects\mysqlconnector\src\MySqlConnector\Core\ConnectionPool.cs:line 141
at MySql.Data.MySqlClient.MySqlConnection.CreateSessionAsync(Nullable`1 ioBehavior, CancellationToken cancellationToken) in C:\projects\mysqlconnector\src\MySqlConnector\MySql.Data.MySqlClient\MySqlConnection.cs:line 507
at MySql.Data.MySqlClient.MySqlConnection.CreateSessionAsync(Nullable`1 ioBehavior, CancellationToken cancellationToken) in C:\projects\mysqlconnector\src\MySqlConnector\MySql.Data.MySqlClient\MySqlConnection.cs:line 523
at MySql.Data.MySqlClient.MySqlConnection.OpenAsync(Nullable`1 ioBehavior, CancellationToken cancellationToken) in C:\projects\mysqlconnector\src\MySqlConnector\MySql.Data.MySqlClient\MySqlConnection.cs:line 232
at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenDbConnectionAsync(Boolean errorsExpected, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenAsync(CancellationToken cancellationToken, Boolean errorsExpected)
at Pomelo.EntityFrameworkCore.MySql.Storage.Internal.MySqlRelationalConnection.BeginTransactionAsync(IsolationLevel isolationLevel, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.BeginTransactionAsync(CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(DbContext _, ValueTuple`2 parameters, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(IReadOnlyList`1 entriesToSave, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.DbContext.SaveChangesAsync(Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken)
pods 的运行状况检查失败,有时 pods 会重新启动:
fail: Microsoft.Extensions.Diagnostics.HealthChecks.DefaultHealthCheckService[103]
Health check Users-Database completed after 18010.9435ms with status Unhealthy and 'FAILED to access users table.'
Google 云存储和 Google Stackdriver 的问题:
Unable to log to provider GoogleStackdriverLogProvider, ex: Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="Getting metadata from plugin failed with error: Exception occurred in metadata credentials plugin. System.Net.Http.HttpRequestException: The SSL connection could not be established, see inner exception. ---> System.IO.IOException: Authentication failed because the remote party has closed the transport stream.
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.PartialFrameCallback(AsyncProtocolRequest asyncRequest)
--- End of stack trace from previous location where exception was thrown ---
at System.Net.Security.SslState.ThrowIfExceptional()
at System.Net.Security.SslState.InternalEndProcessAuthentication(LazyAsyncResult lazyResult)
at System.Net.Security.SslState.EndProcessAuthentication(IAsyncResult result)
at System.Net.Security.SslStream.EndAuthenticateAsClient(IAsyncResult asyncResult)
at System.Net.Security.SslStream.<>c.<AuthenticateAsClientAsync>b__47_1(IAsyncResult iar)
at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
at System.Net.Http.ConnectHelper.EstablishSslConnectionAsyncCore(Stream stream, SslClientAuthenticationOptions sslOptions, CancellationToken cancellationToken)
--- End of inner exception stack trace ---
at Google.Apis.Http.ConfigurableMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
at Google.Apis.Auth.OAuth2.Requests.TokenRequestExtenstions.ExecuteAsync(TokenRequest request, HttpClient httpClient, String tokenServerUrl, CancellationToken taskCancellationToken, IClock clock, ILogger logger)
at Google.Apis.Auth.OAuth2.ServiceAccountCredential.RequestAccessTokenAsync(CancellationToken taskCancellationToken)
at Google.Apis.Auth.OAuth2.TokenRefreshManager.RefreshTokenAsync()
at Google.Apis.Auth.OAuth2.TokenRefreshManager.GetAccessTokenForRequestAsync(CancellationToken cancellationToken)
at Google.Apis.Auth.OAuth2.ServiceAccountCredential.GetAccessTokenForRequestAsync(String authUri, CancellationToken cancellationToken)
at Google.Apis.Auth.OAuth2.ServiceCredential.GetAccessTokenWithHeadersForRequestAsync(String authUri, CancellationToken cancellationToken)
at Grpc.Auth.GoogleAuthInterceptors.<>c__DisplayClass3_0.<<FromCredential>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at Grpc.Core.Internal.NativeMetadataCredentialsPlugin.GetMetadataAsync(AuthInterceptorContext context, IntPtr callbackPtr, IntPtr userDataPtr)", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1611499756.899018095","description":"Getting metadata from plugin failed with error: Exception occurred in metadata credentials plugin. System.Net.Http.HttpRequestException: The SSL connection could not be established, see inner exception. ---> System.IO.IOException: Authentication failed because the remote party has closed the transport stream.\n at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.PartialFrameCallback(AsyncProtocolRequest asyncRequest)\n--- End of stack trace from previous location where exception was thrown ---\n at System.Net.Security.SslState.ThrowIfExceptional()\n at System.Net.Security.SslState.InternalEndProcessAuthentication(LazyAsyncResult lazyResult)\n at System.Net.Security.SslState.EndProcessAuthentication(IAsyncResult result)\n at System.Net.Security.SslStream.EndAuthenticateAsClient(IAsyncResult asyncResult)\n at System.Net.Security.SslStream.<>c.<AuthenticateAsClientAsync>b__47_1(IAsyncResult iar)\n at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)\n--- End of stack trace from previous location where exception was thrown ---\n at System.Net.Http.ConnectHelper.EstablishSslConnectionAsyncCore(Stream stream, SslClientAuthenticationOptions sslOptions, CancellationToken cancellationToken)\n --- End of inner exception stack trace ---\n at Google.Apis.Http.ConfigurableMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)\n at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)\n at Google.Apis.Auth.OAuth2.Requests.TokenRequestExtenstions.ExecuteAsync(TokenRequest request, HttpClient httpClient, String tokenServerUrl, CancellationToken taskCancellationToken, IClock clock, ILogger logger)\n at Google.Apis.Auth.OAuth2.ServiceAccountCredential.RequestAccessTokenAsync(CancellationToken taskCancellationToken)\n at Google.Apis.Auth.OAuth2.TokenRefreshManager.RefreshTokenAsync()\n at Google.Apis.Auth.OAuth2.TokenRefreshManager.GetAccessTokenForRequestAsync(CancellationToken cancellationToken)\n at Google.Apis.Auth.OAuth2.ServiceAccountCredential.GetAccessTokenForRequestAsync(String authUri, CancellationToken cancellationToken)\n at Google.Apis.Auth.OAuth2.ServiceCredential.GetAccessTokenWithHeadersForRequestAsync(String authUri, CancellationToken cancellationToken)\n at Grpc.Auth.GoogleAuthInterceptors.<>c__DisplayClass3_0.<<FromCredential>b__0>d.MoveNext()\n--- End of stack trace from previous location where exception was thrown ---\n at Grpc.Core.Internal.NativeMetadataCredentialsPlugin.GetMetadataAsync(AuthInterceptorContext context, IntPtr callbackPtr, IntPtr userDataPtr)","file":"/var/local/git/grpc/src/core/lib/security/credentials/plugin/plugin_credentials.cc","file_line":93,"grpc_status":14}")
at Google.Api.Gax.Grpc.ApiCallRetryExtensions.<>c__DisplayClass0_0`2.<<WithRetry>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at MyCode.Utils.LogService.Providers.GoogleStackdriverLogProvider.WriteAsync(IEnumerable`1 entries)
at MyCode.Utils.LogService.Logger.WritePendingEntries()
875 Information Error while calling `UploadGZipObjectAsync`. Retry (00:00:08) taking place. Exception=System.Net.Http.HttpRequestException: The SSL connection could not be established, see inner exception. ---> System.IO.IOException: Authentication failed because the remote party has closed the transport stream.
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.PartialFrameCallback(AsyncProtocolRequest asyncRequest)
--- End of stack trace from previous location where exception was thrown ---
at System.Net.Security.SslState.ThrowIfExceptional()
at System.Net.Security.SslState.InternalEndProcessAuthentication(LazyAsyncResult lazyResult)
at System.Net.Security.SslState.EndProcessAuthentication(IAsyncResult result)
at System.Net.Security.SslStream.EndAuthenticateAsClient(IAsyncResult asyncResult)
at System.Net.Security.SslStream.<>c.<AuthenticateAsClientAsync>b__47_1(IAsyncResult iar)
at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
at System.Net.Http.ConnectHelper.EstablishSslConnectionAsyncCore(Stream stream, SslClientAuthenticationOptions sslOptions, CancellationToken cancellationToken)
--- End of inner exception stack trace ---
at Google.Cloud.Storage.V1.StorageClientImpl.UploadHelper.CheckFinalProgress()
at Google.Cloud.Storage.V1.StorageClientImpl.UploadHelper.ExecuteAsync(CancellationToken cancellationToken)
似乎存在多个连接问题,这些问题在发生高负载时是相关的和暂时的。
我的调查
查看 pods 的 dmesg 存在连接问题并收到此消息:
[Tue Jan 26 15:44:30 2021] systemd-journald[120]: Received request to flush runtime journal from PID 1
[Tue Jan 26 15:44:31 2021] EXT4-fs (sda1): resizing filesystem from 1533435 to 25126395 blocks
[Tue Jan 26 15:44:33 2021] EXT4-fs (sda1): resized filesystem to 25126395
[Tue Jan 26 15:44:40 2021] Bridge firewalling registered
[Tue Jan 26 15:44:40 2021] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
[Tue Jan 26 15:44:43 2021] EXT4-fs (sda1): re-mounted. Opts: commit=30
[Tue Jan 26 15:44:43 2021] EXT4-fs (sda1): re-mounted. Opts: commit=30
[Tue Jan 26 15:44:46 2021] EXT4-fs (sda1): re-mounted. Opts: commit=30
[Tue Jan 26 15:44:49 2021] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
[Tue Jan 26 15:44:50 2021] IPVS: Registered protocols (TCP, UDP, SCTP, AH, ESP)
[Tue Jan 26 15:44:50 2021] IPVS: Connection hash table configured (size=4096, memory=64Kbytes)
[Tue Jan 26 15:44:50 2021] IPVS: ipvs loaded.
[Tue Jan 26 15:44:50 2021] IPVS: [rr] scheduler registered.
[Tue Jan 26 15:44:50 2021] IPVS: [wrr] scheduler registered.
[Tue Jan 26 15:44:50 2021] IPVS: [sh] scheduler registered.
[Tue Jan 26 15:44:53 2021] IPv6: ADDRCONF(NETDEV_UP): veth1e78fb72: link is not ready
[Tue Jan 26 15:44:53 2021] IPv6: ADDRCONF(NETDEV_CHANGE): veth1e78fb72: link becomes ready
[Tue Jan 26 15:44:53 2021] cbr0: port 1(veth1e78fb72) entered blocking state
[Tue Jan 26 15:44:53 2021] cbr0: port 1(veth1e78fb72) entered disabled state
[Tue Jan 26 15:44:53 2021] device veth1e78fb72 entered promiscuous mode
[Tue Jan 26 15:44:53 2021] cbr0: port 1(veth1e78fb72) entered blocking state
[Tue Jan 26 15:44:53 2021] cbr0: port 1(veth1e78fb72) entered forwarding state
[Tue Jan 26 15:44:53 2021] device cbr0 entered promiscuous mode
[Tue Jan 26 15:44:53 2021] cbr0: port 2(veth51bc9563) entered blocking state
[Tue Jan 26 15:44:53 2021] cbr0: port 2(veth51bc9563) entered disabled state
[Tue Jan 26 15:44:53 2021] device veth51bc9563 entered promiscuous mode
[Tue Jan 26 15:44:53 2021] cbr0: port 2(veth51bc9563) entered blocking state
[Tue Jan 26 15:44:53 2021] cbr0: port 2(veth51bc9563) entered forwarding state
[Tue Jan 26 15:44:53 2021] cbr0: port 3(veth902031c6) entered blocking state
[Tue Jan 26 15:44:53 2021] cbr0: port 3(veth902031c6) entered disabled state
[Tue Jan 26 15:44:53 2021] device veth902031c6 entered promiscuous mode
[Tue Jan 26 15:44:53 2021] cbr0: port 3(veth902031c6) entered blocking state
[Tue Jan 26 15:44:53 2021] cbr0: port 3(veth902031c6) entered forwarding state
[Wed Jan 27 07:45:00 2021] python3 invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=609
[Wed Jan 27 07:45:00 2021] python3 cpuset=e7bf5c5a2cb0af87a765c718705dfe197f0e462e955f64e380511e1a6101b6b4 mems_allowed=0
[Wed Jan 27 07:45:00 2021] CPU: 1 PID: 353763 Comm: python3 Not tainted 4.19.150+ #1
[Wed Jan 27 07:45:00 2021] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[Wed Jan 27 07:45:00 2021] Call Trace:
[Wed Jan 27 07:45:00 2021] dump_stack+0x61/0x96
[Wed Jan 27 07:45:00 2021] dump_header+0x76/0x3a0
[Wed Jan 27 07:45:00 2021] oom_kill_process+0xb1/0x280
[Wed Jan 27 07:45:00 2021] out_of_memory+0x30a/0x4b0
[Wed Jan 27 07:45:00 2021] try_charge+0x6b8/0x9c0
[Wed Jan 27 07:45:00 2021] mem_cgroup_try_charge+0x1d7/0x220
[Wed Jan 27 07:45:00 2021] mem_cgroup_try_charge_delay+0x1e/0x40
[Wed Jan 27 07:45:00 2021] handle_mm_fault+0xeeb/0x1640
[Wed Jan 27 07:45:00 2021] __do_page_fault+0x25f/0x480
[Wed Jan 27 07:45:00 2021] ? page_fault+0x8/0x30
[Wed Jan 27 07:45:00 2021] page_fault+0x1e/0x30
[Wed Jan 27 07:45:00 2021] RIP: 0033:0x7f8ab337dd16
[Wed Jan 27 07:45:00 2021] Code: 8e c0 01 00 00 c5 fe 6f 06 c5 fe 6f 4e 20 c5 fe 6f 56 40 c5 fe 6f 5e 60 48 81 c6 80 00 00 00 48 81 ea 80 00 00 00 c5 fd e7 07 <c5> fd e7 4f 20 c5 fd e7 57 40 c5 fd e7 5f 60 48 81 c7 80 00 00 00
[Wed Jan 27 07:45:00 2021] RSP: 002b:00007f88b2ff3568 EFLAGS: 00010202
[Wed Jan 27 07:45:00 2021] RAX: 00007f88976de058 RBX: 00000000000000e4 RCX: 00007f889c0af12c
[Wed Jan 27 07:45:00 2021] RDX: 000000000060c0ec RSI: 00007f88a05b8048 RDI: 00007f889baa2fe0
[Wed Jan 27 07:45:00 2021] RBP: 00007f889c1f3040 R08: fffffffffffffff8 R09: 0000000000000000
[Wed Jan 27 07:45:00 2021] R10: 00007f889c0af14c R11: 00007f88976de058 R12: 0000000000000000
[Wed Jan 27 07:45:00 2021] R13: 00007f88976de058 R14: 00000000000000a4 R15: 312f67726f2e3377
[Wed Jan 27 07:45:00 2021] Task in /kubepods/burstable/pode8528ae7-be9f-4b0e-b6e8-fc8b298b3ba3/e7bf5c5a2cb0af87a765c718705dfe197f0e462e955f64e380511e1a6101b6b4 killed as a result of limit of /kubepods/burstable/pode8528ae7-be9f-4b0e-b6e8-fc8b298b3ba3
[Wed Jan 27 07:45:00 2021] memory: usage 10485700kB, limit 10485760kB, failcnt 123
[Wed Jan 27 07:45:00 2021] memory+swap: usage 10485700kB, limit 9007199254740988kB, failcnt 0
[Wed Jan 27 07:45:00 2021] kmem: usage 80624kB, limit 9007199254740988kB, failcnt 0
[Wed Jan 27 07:45:00 2021] Memory cgroup stats for /kubepods/burstable/pode8528ae7-be9f-4b0e-b6e8-fc8b298b3ba3: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[Wed Jan 27 07:45:00 2021] Memory cgroup stats for /kubepods/burstable/pode8528ae7-be9f-4b0e-b6e8-fc8b298b3ba3/db170fc2e6dfbc965920076d9c1c264724b871b1769612889e6a6687779d8072: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:44KB inactive_file:0KB active_file:0KB unevictable:0KB
[Wed Jan 27 07:45:00 2021] Memory cgroup stats for /kubepods/burstable/pode8528ae7-be9f-4b0e-b6e8-fc8b298b3ba3/e7bf5c5a2cb0af87a765c718705dfe197f0e462e955f64e380511e1a6101b6b4: cache:0KB rss:10404888KB rss_huge:1142784KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:10405016KB inactive_file:0KB active_file:0KB unevictable:0KB
[Wed Jan 27 07:45:00 2021] Tasks state (memory values in pages):
[Wed Jan 27 07:45:00 2021] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Wed Jan 27 07:45:00 2021] [ 2092] 0 2092 256 1 32768 0 -998 pause
[Wed Jan 27 07:45:00 2021] [ 4016] 0 4016 1160 224 61440 0 609 startup.sh
[Wed Jan 27 07:45:00 2021] [ 4049] 0 4049 2458626 1604375 14508032 0 609 python3
[Wed Jan 27 07:45:00 2021] [ 4926] 0 4926 2525494 1667890 15065088 0 609 python3
[Wed Jan 27 07:45:00 2021] [ 4927] 0 4927 2499074 1637148 14647296 0 609 python3
[Wed Jan 27 07:45:00 2021] [ 4928] 0 4928 2503558 1637534 14741504 0 609 python3
[Wed Jan 27 07:45:00 2021] Memory cgroup out of memory: Kill process 4926 (python3) score 1246 or sacrifice child
[Wed Jan 27 07:45:00 2021] Killed process 4926 (python3) total-vm:10101976kB, anon-rss:6560464kB, file-rss:111096kB, shmem-rss:0kB
[Wed Jan 27 07:45:00 2021] oom_reaper: reaped process 4926 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
这似乎相关吗?
我检查了 NAT 网关并检查了我的节点是否有足够的端口。我有 2 个 IP 地址和大约 50 个节点。 运行 每个 VM 的最小端口数为 128(尝试增加到 1024,但没有解决)这对于 1024 个实例应该足够了。所以这似乎不是这里的问题。
问题
我该如何解决这个问题?
如何进一步调查?
我们发现问题的发生是因为我们使用了来自异步代码的 c# 内置 ManualResetEvent
。似乎它对应用程序线程造成了某种死锁。
改用 SemaphoreSlim
解决了这个问题。
我有一个 GKE v1.17.400 专用集群 运行 NAT 网关。在集群上,我有多个应用程序使用 google 服务,例如 stackdriver、pubsub 和 cloud sql。
我的应用程序在 .net-core 2.2 上 运行。它订阅并发布到 Pub/Sub 主题。
当负载很高时,我遇到与 Google 云服务的连接问题。
这个问题会导致许多不同的日志,例如:
与云的连接超时 sql:
MySql.Data.MySqlClient.MySqlException (0x80004005): Connect Timeout expired. ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
at MySqlConnector.Core.ServerSession.ConnectAsync(ConnectionSettings cs, ILoadBalancer loadBalancer, IOBehavior ioBehavior, CancellationToken cancellationToken) in C:\projects\mysqlconnector\src\MySqlConnector\Core\ServerSession.cs:line 360
at MySqlConnector.Core.ConnectionPool.GetSessionAsync(MySqlConnection connection, IOBehavior ioBehavior, CancellationToken cancellationToken) in C:\projects\mysqlconnector\src\MySqlConnector\Core\ConnectionPool.cs:line 112
at MySqlConnector.Core.ConnectionPool.GetSessionAsync(MySqlConnection connection, IOBehavior ioBehavior, CancellationToken cancellationToken) in C:\projects\mysqlconnector\src\MySqlConnector\Core\ConnectionPool.cs:line 141
at MySql.Data.MySqlClient.MySqlConnection.CreateSessionAsync(Nullable`1 ioBehavior, CancellationToken cancellationToken) in C:\projects\mysqlconnector\src\MySqlConnector\MySql.Data.MySqlClient\MySqlConnection.cs:line 507
at MySql.Data.MySqlClient.MySqlConnection.CreateSessionAsync(Nullable`1 ioBehavior, CancellationToken cancellationToken) in C:\projects\mysqlconnector\src\MySqlConnector\MySql.Data.MySqlClient\MySqlConnection.cs:line 523
at MySql.Data.MySqlClient.MySqlConnection.OpenAsync(Nullable`1 ioBehavior, CancellationToken cancellationToken) in C:\projects\mysqlconnector\src\MySqlConnector\MySql.Data.MySqlClient\MySqlConnection.cs:line 232
at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenDbConnectionAsync(Boolean errorsExpected, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenAsync(CancellationToken cancellationToken, Boolean errorsExpected)
at Pomelo.EntityFrameworkCore.MySql.Storage.Internal.MySqlRelationalConnection.BeginTransactionAsync(IsolationLevel isolationLevel, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.BeginTransactionAsync(CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(DbContext _, ValueTuple`2 parameters, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(IReadOnlyList`1 entriesToSave, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.DbContext.SaveChangesAsync(Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken)
pods 的运行状况检查失败,有时 pods 会重新启动:
fail: Microsoft.Extensions.Diagnostics.HealthChecks.DefaultHealthCheckService[103]
Health check Users-Database completed after 18010.9435ms with status Unhealthy and 'FAILED to access users table.'
Google 云存储和 Google Stackdriver 的问题:
Unable to log to provider GoogleStackdriverLogProvider, ex: Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="Getting metadata from plugin failed with error: Exception occurred in metadata credentials plugin. System.Net.Http.HttpRequestException: The SSL connection could not be established, see inner exception. ---> System.IO.IOException: Authentication failed because the remote party has closed the transport stream.
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.PartialFrameCallback(AsyncProtocolRequest asyncRequest)
--- End of stack trace from previous location where exception was thrown ---
at System.Net.Security.SslState.ThrowIfExceptional()
at System.Net.Security.SslState.InternalEndProcessAuthentication(LazyAsyncResult lazyResult)
at System.Net.Security.SslState.EndProcessAuthentication(IAsyncResult result)
at System.Net.Security.SslStream.EndAuthenticateAsClient(IAsyncResult asyncResult)
at System.Net.Security.SslStream.<>c.<AuthenticateAsClientAsync>b__47_1(IAsyncResult iar)
at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
at System.Net.Http.ConnectHelper.EstablishSslConnectionAsyncCore(Stream stream, SslClientAuthenticationOptions sslOptions, CancellationToken cancellationToken)
--- End of inner exception stack trace ---
at Google.Apis.Http.ConfigurableMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
at Google.Apis.Auth.OAuth2.Requests.TokenRequestExtenstions.ExecuteAsync(TokenRequest request, HttpClient httpClient, String tokenServerUrl, CancellationToken taskCancellationToken, IClock clock, ILogger logger)
at Google.Apis.Auth.OAuth2.ServiceAccountCredential.RequestAccessTokenAsync(CancellationToken taskCancellationToken)
at Google.Apis.Auth.OAuth2.TokenRefreshManager.RefreshTokenAsync()
at Google.Apis.Auth.OAuth2.TokenRefreshManager.GetAccessTokenForRequestAsync(CancellationToken cancellationToken)
at Google.Apis.Auth.OAuth2.ServiceAccountCredential.GetAccessTokenForRequestAsync(String authUri, CancellationToken cancellationToken)
at Google.Apis.Auth.OAuth2.ServiceCredential.GetAccessTokenWithHeadersForRequestAsync(String authUri, CancellationToken cancellationToken)
at Grpc.Auth.GoogleAuthInterceptors.<>c__DisplayClass3_0.<<FromCredential>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at Grpc.Core.Internal.NativeMetadataCredentialsPlugin.GetMetadataAsync(AuthInterceptorContext context, IntPtr callbackPtr, IntPtr userDataPtr)", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1611499756.899018095","description":"Getting metadata from plugin failed with error: Exception occurred in metadata credentials plugin. System.Net.Http.HttpRequestException: The SSL connection could not be established, see inner exception. ---> System.IO.IOException: Authentication failed because the remote party has closed the transport stream.\n at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n at System.Net.Security.SslState.PartialFrameCallback(AsyncProtocolRequest asyncRequest)\n--- End of stack trace from previous location where exception was thrown ---\n at System.Net.Security.SslState.ThrowIfExceptional()\n at System.Net.Security.SslState.InternalEndProcessAuthentication(LazyAsyncResult lazyResult)\n at System.Net.Security.SslState.EndProcessAuthentication(IAsyncResult result)\n at System.Net.Security.SslStream.EndAuthenticateAsClient(IAsyncResult asyncResult)\n at System.Net.Security.SslStream.<>c.<AuthenticateAsClientAsync>b__47_1(IAsyncResult iar)\n at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)\n--- End of stack trace from previous location where exception was thrown ---\n at System.Net.Http.ConnectHelper.EstablishSslConnectionAsyncCore(Stream stream, SslClientAuthenticationOptions sslOptions, CancellationToken cancellationToken)\n --- End of inner exception stack trace ---\n at Google.Apis.Http.ConfigurableMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)\n at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)\n at Google.Apis.Auth.OAuth2.Requests.TokenRequestExtenstions.ExecuteAsync(TokenRequest request, HttpClient httpClient, String tokenServerUrl, CancellationToken taskCancellationToken, IClock clock, ILogger logger)\n at Google.Apis.Auth.OAuth2.ServiceAccountCredential.RequestAccessTokenAsync(CancellationToken taskCancellationToken)\n at Google.Apis.Auth.OAuth2.TokenRefreshManager.RefreshTokenAsync()\n at Google.Apis.Auth.OAuth2.TokenRefreshManager.GetAccessTokenForRequestAsync(CancellationToken cancellationToken)\n at Google.Apis.Auth.OAuth2.ServiceAccountCredential.GetAccessTokenForRequestAsync(String authUri, CancellationToken cancellationToken)\n at Google.Apis.Auth.OAuth2.ServiceCredential.GetAccessTokenWithHeadersForRequestAsync(String authUri, CancellationToken cancellationToken)\n at Grpc.Auth.GoogleAuthInterceptors.<>c__DisplayClass3_0.<<FromCredential>b__0>d.MoveNext()\n--- End of stack trace from previous location where exception was thrown ---\n at Grpc.Core.Internal.NativeMetadataCredentialsPlugin.GetMetadataAsync(AuthInterceptorContext context, IntPtr callbackPtr, IntPtr userDataPtr)","file":"/var/local/git/grpc/src/core/lib/security/credentials/plugin/plugin_credentials.cc","file_line":93,"grpc_status":14}")
at Google.Api.Gax.Grpc.ApiCallRetryExtensions.<>c__DisplayClass0_0`2.<<WithRetry>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at MyCode.Utils.LogService.Providers.GoogleStackdriverLogProvider.WriteAsync(IEnumerable`1 entries)
at MyCode.Utils.LogService.Logger.WritePendingEntries()
875 Information Error while calling `UploadGZipObjectAsync`. Retry (00:00:08) taking place. Exception=System.Net.Http.HttpRequestException: The SSL connection could not be established, see inner exception. ---> System.IO.IOException: Authentication failed because the remote party has closed the transport stream.
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslState.PartialFrameCallback(AsyncProtocolRequest asyncRequest)
--- End of stack trace from previous location where exception was thrown ---
at System.Net.Security.SslState.ThrowIfExceptional()
at System.Net.Security.SslState.InternalEndProcessAuthentication(LazyAsyncResult lazyResult)
at System.Net.Security.SslState.EndProcessAuthentication(IAsyncResult result)
at System.Net.Security.SslStream.EndAuthenticateAsClient(IAsyncResult asyncResult)
at System.Net.Security.SslStream.<>c.<AuthenticateAsClientAsync>b__47_1(IAsyncResult iar)
at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
at System.Net.Http.ConnectHelper.EstablishSslConnectionAsyncCore(Stream stream, SslClientAuthenticationOptions sslOptions, CancellationToken cancellationToken)
--- End of inner exception stack trace ---
at Google.Cloud.Storage.V1.StorageClientImpl.UploadHelper.CheckFinalProgress()
at Google.Cloud.Storage.V1.StorageClientImpl.UploadHelper.ExecuteAsync(CancellationToken cancellationToken)
似乎存在多个连接问题,这些问题在发生高负载时是相关的和暂时的。
我的调查
查看 pods 的 dmesg 存在连接问题并收到此消息:
[Tue Jan 26 15:44:30 2021] systemd-journald[120]: Received request to flush runtime journal from PID 1
[Tue Jan 26 15:44:31 2021] EXT4-fs (sda1): resizing filesystem from 1533435 to 25126395 blocks
[Tue Jan 26 15:44:33 2021] EXT4-fs (sda1): resized filesystem to 25126395
[Tue Jan 26 15:44:40 2021] Bridge firewalling registered
[Tue Jan 26 15:44:40 2021] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
[Tue Jan 26 15:44:43 2021] EXT4-fs (sda1): re-mounted. Opts: commit=30
[Tue Jan 26 15:44:43 2021] EXT4-fs (sda1): re-mounted. Opts: commit=30
[Tue Jan 26 15:44:46 2021] EXT4-fs (sda1): re-mounted. Opts: commit=30
[Tue Jan 26 15:44:49 2021] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
[Tue Jan 26 15:44:50 2021] IPVS: Registered protocols (TCP, UDP, SCTP, AH, ESP)
[Tue Jan 26 15:44:50 2021] IPVS: Connection hash table configured (size=4096, memory=64Kbytes)
[Tue Jan 26 15:44:50 2021] IPVS: ipvs loaded.
[Tue Jan 26 15:44:50 2021] IPVS: [rr] scheduler registered.
[Tue Jan 26 15:44:50 2021] IPVS: [wrr] scheduler registered.
[Tue Jan 26 15:44:50 2021] IPVS: [sh] scheduler registered.
[Tue Jan 26 15:44:53 2021] IPv6: ADDRCONF(NETDEV_UP): veth1e78fb72: link is not ready
[Tue Jan 26 15:44:53 2021] IPv6: ADDRCONF(NETDEV_CHANGE): veth1e78fb72: link becomes ready
[Tue Jan 26 15:44:53 2021] cbr0: port 1(veth1e78fb72) entered blocking state
[Tue Jan 26 15:44:53 2021] cbr0: port 1(veth1e78fb72) entered disabled state
[Tue Jan 26 15:44:53 2021] device veth1e78fb72 entered promiscuous mode
[Tue Jan 26 15:44:53 2021] cbr0: port 1(veth1e78fb72) entered blocking state
[Tue Jan 26 15:44:53 2021] cbr0: port 1(veth1e78fb72) entered forwarding state
[Tue Jan 26 15:44:53 2021] device cbr0 entered promiscuous mode
[Tue Jan 26 15:44:53 2021] cbr0: port 2(veth51bc9563) entered blocking state
[Tue Jan 26 15:44:53 2021] cbr0: port 2(veth51bc9563) entered disabled state
[Tue Jan 26 15:44:53 2021] device veth51bc9563 entered promiscuous mode
[Tue Jan 26 15:44:53 2021] cbr0: port 2(veth51bc9563) entered blocking state
[Tue Jan 26 15:44:53 2021] cbr0: port 2(veth51bc9563) entered forwarding state
[Tue Jan 26 15:44:53 2021] cbr0: port 3(veth902031c6) entered blocking state
[Tue Jan 26 15:44:53 2021] cbr0: port 3(veth902031c6) entered disabled state
[Tue Jan 26 15:44:53 2021] device veth902031c6 entered promiscuous mode
[Tue Jan 26 15:44:53 2021] cbr0: port 3(veth902031c6) entered blocking state
[Tue Jan 26 15:44:53 2021] cbr0: port 3(veth902031c6) entered forwarding state
[Wed Jan 27 07:45:00 2021] python3 invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=609
[Wed Jan 27 07:45:00 2021] python3 cpuset=e7bf5c5a2cb0af87a765c718705dfe197f0e462e955f64e380511e1a6101b6b4 mems_allowed=0
[Wed Jan 27 07:45:00 2021] CPU: 1 PID: 353763 Comm: python3 Not tainted 4.19.150+ #1
[Wed Jan 27 07:45:00 2021] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[Wed Jan 27 07:45:00 2021] Call Trace:
[Wed Jan 27 07:45:00 2021] dump_stack+0x61/0x96
[Wed Jan 27 07:45:00 2021] dump_header+0x76/0x3a0
[Wed Jan 27 07:45:00 2021] oom_kill_process+0xb1/0x280
[Wed Jan 27 07:45:00 2021] out_of_memory+0x30a/0x4b0
[Wed Jan 27 07:45:00 2021] try_charge+0x6b8/0x9c0
[Wed Jan 27 07:45:00 2021] mem_cgroup_try_charge+0x1d7/0x220
[Wed Jan 27 07:45:00 2021] mem_cgroup_try_charge_delay+0x1e/0x40
[Wed Jan 27 07:45:00 2021] handle_mm_fault+0xeeb/0x1640
[Wed Jan 27 07:45:00 2021] __do_page_fault+0x25f/0x480
[Wed Jan 27 07:45:00 2021] ? page_fault+0x8/0x30
[Wed Jan 27 07:45:00 2021] page_fault+0x1e/0x30
[Wed Jan 27 07:45:00 2021] RIP: 0033:0x7f8ab337dd16
[Wed Jan 27 07:45:00 2021] Code: 8e c0 01 00 00 c5 fe 6f 06 c5 fe 6f 4e 20 c5 fe 6f 56 40 c5 fe 6f 5e 60 48 81 c6 80 00 00 00 48 81 ea 80 00 00 00 c5 fd e7 07 <c5> fd e7 4f 20 c5 fd e7 57 40 c5 fd e7 5f 60 48 81 c7 80 00 00 00
[Wed Jan 27 07:45:00 2021] RSP: 002b:00007f88b2ff3568 EFLAGS: 00010202
[Wed Jan 27 07:45:00 2021] RAX: 00007f88976de058 RBX: 00000000000000e4 RCX: 00007f889c0af12c
[Wed Jan 27 07:45:00 2021] RDX: 000000000060c0ec RSI: 00007f88a05b8048 RDI: 00007f889baa2fe0
[Wed Jan 27 07:45:00 2021] RBP: 00007f889c1f3040 R08: fffffffffffffff8 R09: 0000000000000000
[Wed Jan 27 07:45:00 2021] R10: 00007f889c0af14c R11: 00007f88976de058 R12: 0000000000000000
[Wed Jan 27 07:45:00 2021] R13: 00007f88976de058 R14: 00000000000000a4 R15: 312f67726f2e3377
[Wed Jan 27 07:45:00 2021] Task in /kubepods/burstable/pode8528ae7-be9f-4b0e-b6e8-fc8b298b3ba3/e7bf5c5a2cb0af87a765c718705dfe197f0e462e955f64e380511e1a6101b6b4 killed as a result of limit of /kubepods/burstable/pode8528ae7-be9f-4b0e-b6e8-fc8b298b3ba3
[Wed Jan 27 07:45:00 2021] memory: usage 10485700kB, limit 10485760kB, failcnt 123
[Wed Jan 27 07:45:00 2021] memory+swap: usage 10485700kB, limit 9007199254740988kB, failcnt 0
[Wed Jan 27 07:45:00 2021] kmem: usage 80624kB, limit 9007199254740988kB, failcnt 0
[Wed Jan 27 07:45:00 2021] Memory cgroup stats for /kubepods/burstable/pode8528ae7-be9f-4b0e-b6e8-fc8b298b3ba3: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[Wed Jan 27 07:45:00 2021] Memory cgroup stats for /kubepods/burstable/pode8528ae7-be9f-4b0e-b6e8-fc8b298b3ba3/db170fc2e6dfbc965920076d9c1c264724b871b1769612889e6a6687779d8072: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:44KB inactive_file:0KB active_file:0KB unevictable:0KB
[Wed Jan 27 07:45:00 2021] Memory cgroup stats for /kubepods/burstable/pode8528ae7-be9f-4b0e-b6e8-fc8b298b3ba3/e7bf5c5a2cb0af87a765c718705dfe197f0e462e955f64e380511e1a6101b6b4: cache:0KB rss:10404888KB rss_huge:1142784KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:10405016KB inactive_file:0KB active_file:0KB unevictable:0KB
[Wed Jan 27 07:45:00 2021] Tasks state (memory values in pages):
[Wed Jan 27 07:45:00 2021] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Wed Jan 27 07:45:00 2021] [ 2092] 0 2092 256 1 32768 0 -998 pause
[Wed Jan 27 07:45:00 2021] [ 4016] 0 4016 1160 224 61440 0 609 startup.sh
[Wed Jan 27 07:45:00 2021] [ 4049] 0 4049 2458626 1604375 14508032 0 609 python3
[Wed Jan 27 07:45:00 2021] [ 4926] 0 4926 2525494 1667890 15065088 0 609 python3
[Wed Jan 27 07:45:00 2021] [ 4927] 0 4927 2499074 1637148 14647296 0 609 python3
[Wed Jan 27 07:45:00 2021] [ 4928] 0 4928 2503558 1637534 14741504 0 609 python3
[Wed Jan 27 07:45:00 2021] Memory cgroup out of memory: Kill process 4926 (python3) score 1246 or sacrifice child
[Wed Jan 27 07:45:00 2021] Killed process 4926 (python3) total-vm:10101976kB, anon-rss:6560464kB, file-rss:111096kB, shmem-rss:0kB
[Wed Jan 27 07:45:00 2021] oom_reaper: reaped process 4926 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
这似乎相关吗?
我检查了 NAT 网关并检查了我的节点是否有足够的端口。我有 2 个 IP 地址和大约 50 个节点。 运行 每个 VM 的最小端口数为 128(尝试增加到 1024,但没有解决)这对于 1024 个实例应该足够了。所以这似乎不是这里的问题。
问题
我该如何解决这个问题?
如何进一步调查?
我们发现问题的发生是因为我们使用了来自异步代码的 c# 内置 ManualResetEvent
。似乎它对应用程序线程造成了某种死锁。
改用 SemaphoreSlim
解决了这个问题。