IIS 回收 AppPool 后节点有时不加入 Akka.Net 集群

Node sometimes doesn't join the Akka.Net Cluster after IIS recycle AppPool

我们为短信、电子邮件和推送通知创建了一个 Akka 集群基础设施。系统中存在 3 种不同类型的节点,即客户端、发送方和灯塔。 Web 应用程序和 API 应用程序正在使用客户端角色(Web 和 API 托管在 IIS 上)。 Lighthouse 和 Sender 角色作为 Windows 服务托管。考虑到 Web 应用程序和 API 应用程序 AppPools 由于 IIS 而回收,在 global.asax.cs 的启动和停止事件中,我们关闭客户端角色中的参与者系统并重新启动。我们可以通过日志观察系统成功关闭并加入集群。

但有时,当 AppPool 回收时,客户端 ActorSystem 启动但无法加入集群,我们的通知停止工作(这对我们来说是一个巨大的问题)。当我们手动关闭 ActorSystem 并手动使其再次运行时,它会加入集群。这种情况大约每两天发生一次。

我们可以观察到Client在Error之前加入了Cluster;

Node [akka.tcp://NotificationSystem@...:41350] is JOINING, roles [client]
Leader is moving node [akka.tcp://NotificationSystem@...:41350] to [Up]

查看日志,客户端加入集群后出现如下错误;

Shut down address: akka.tcp://NotificationSystem@...:41350Akka.Remote.ShutDownAssociation: Shut down address: akka.tcp://NotificationSystem@...:41350 ---> Akka.Remote.Transport.InvalidAssociationException: The remote system terminated the association because it is shutting down. --- End of inner exception stack trace --- at Akka.Remote.EndpointWriter.PublishAndThrow(Exception reason, LogLevel level) at Akka.Remote.EndpointWriter.b__20_0(Exception ex) at Akka.Actor.LocalOnlyDecider.Decide(Exception cause) at Akka.Actor.OneForOneStrategy.Handle(IActorRef child, Exception x) at Akka.Actor.SupervisorStrategy.HandleFailure(ActorCell actorCell, Exception cause, ChildRestartStats failedChildStats, IReadOnlyCollection1 allChildren) at Akka.Actor.ActorCell.HandleFailed(Failed f) at Akka.Actor.ActorCell.SystemInvoke(Envelope envelope)--- End of stack trace from previous location where exception was thrown --- at Akka.Actor.ActorCell.HandleFailed(Failed f) at Akka.Actor.ActorCell.SystemInvoke(Envelope envelope)Akka.Remote.ShutDownAssociation: Shut down address: akka.tcp://NotificationSystem@...:41350 ---> Akka.Remote.Transport.InvalidAssociationException: The remote system terminated the association because it is shutting down. --- End of inner exception stack trace --- at Akka.Remote.EndpointWriter.PublishAndThrow(Exception reason, LogLevel level) at Akka.Remote.EndpointWriter.b__20_0(Exception ex) at Akka.Actor.LocalOnlyDecider.Decide(Exception cause) at Akka.Actor.OneForOneStrategy.Handle(IActorRef child, Exception x) at Akka.Actor.SupervisorStrategy.HandleFailure(ActorCell actorCell, Exception cause, ChildRestartStats failedChildStats, IReadOnlyCollection`1 allChildren) at Akka.Actor.ActorCell.HandleFailed(Failed f) at Akka.Actor.ActorCell.SystemInvoke(Envelope envelope)--- End of stack trace from previous location where exception was thrown --- at Akka.Actor.ActorCell.HandleFailed(Failed f) at Akka.Actor.ActorCell.SystemInvoke(Envelope envelope)

出错后,我们看到如下错误信息;

Association to [akka.tcp://NotificationSystem@...:41350] having UID [226948907] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.

如果不重新启动客户端 actor,系统不会自行纠正。

我们的客户端角色配置是;

<akka>
<hocon>
    <![CDATA[
        akka{
            loglevel = DEBUG

            actor{
                provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"

                deployment {
                    /coordinatorRouter {
                        router = round-robin-group
                        routees.paths = ["/user/NotificationCoordinator"]
                        cluster {
                                enabled = on
                                max-nr-of-instances-per-node = 1
                                allow-local-routees = off
                                use-role = sender
                        }
                    }                
                }

                serializers {
                    wire = "Akka.Serialization.WireSerializer, Akka.Serialization.Wire"
                }

                serialization-bindings {
                 "System.Object" = wire
                }

                debug{
                    receive = on
                    autoreceive = on
                    lifecycle = on
                    event-stream = on
                    unhandled = on
                }
            }

            remote {
                helios.tcp {
                        transport-class = "Akka.Remote.Transport.Helios.HeliosTcpTransport, Akka.Remote"
                        applied-adapters = []
                        transport-protocol = tcp
                        hostname = "***.***.**.**"
                        port = 0
                }
            }

            cluster {
                    seed-nodes = ["akka.tcp://NotificationSystem@***.***.**.**:5053", "akka.tcp://NotificationSystem@***.***.**.**:5073"]
                    roles = [client]
            }
        }
    ]]>
</hocon>

我们的发件人角色配置是;

  <akka>
<hocon><![CDATA[
            akka{
                loglevel = INFO

                loggers = ["Akka.Logger.NLog.NLogLogger, Akka.Logger.NLog"]

                actor{
                    debug {  
                        # receive = on 
                        # autoreceive = on
                        # lifecycle = on
                        # event-stream = on
                        # unhandled = on
                    }         

                    provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"           

                    serializers {
                        wire = "Akka.Serialization.WireSerializer, Akka.Serialization.Wire"
                    }

                    serialization-bindings {
                     "System.Object" = wire
                    }

                    deployment{
                        /NotificationCoordinator/ApplePushNotificationActor{
                            router = round-robin-pool
                            resizer{
                                enabled = on
                                lower-bound = 3
                                upper-bound = 5
                            }
                        }

                        /NotificationCoordinator/AndroidPushNotificationActor{
                            router = round-robin-pool
                            resizer{
                                enabled = on
                                lower-bound = 3
                                upper-bound = 5
                            }
                        }

                        /NotificationCoordinator/EmailActor{
                            router = round-robin-pool
                            resizer{
                                enabled = on
                                lower-bound = 3
                                upper-bound = 5
                            }
                        }

                        /NotificationCoordinator/SmsActor{
                            router = round-robin-pool
                            resizer{
                                enabled = on
                                lower-bound = 3
                                upper-bound = 5
                            }
                        }

                        /NotificationCoordinator/LoggingCoordinator/ResponseLoggerActor{
                            router = round-robin-pool
                            resizer{
                                enabled = on
                                lower-bound = 3
                                upper-bound = 5
                            }
                        }                           
                    }
                }

             remote{                            
                        log-remote-lifecycle-events = DEBUG
                        log-received-messages = on

                        helios.tcp{
                            transport-class = "Akka.Remote.Transport.Helios.HeliosTcpTransport, Akka.Remote"
                            applied-adapters = []
                            transport-protocol = tcp
                            #will be populated with a dynamic host-name at runtime if left uncommented
                            #public-hostname = "POPULATE STATIC IP HERE"
                            hostname = "***.***.**.**"
                            port = 0
                    }
                }

                cluster {
                        seed-nodes = ["akka.tcp://NotificationSystem@***.***.**.**:5053", "akka.tcp://NotificationSystem@***.***.**.**:5073"]
                        roles = [sender]
                }
            }
        ]]></hocon>

我们如何解决这个问题?谢谢。

这绝对是 Akka.Remote 中 EndpointManager 的错误。 Akka.NET 1.1 - 将于 6 月 14 日发布,应该解决这个问题。我们已经按照这些思路修复了大量集群重新加入错误,但它们尚未发布。 Akka.Cluster 将作为该版本的一部分进行 RTM 编辑。

与此同时,如果您想尝试新位 现在

,您也可以尝试使用 Akka.NET Nightly Builds