为什么我的 TCP 客户端在使用许多连接进行压力测试时无法连接到我的服务器并出现套接字异常

Why do my TCP clients fail to connect to my server with a socket exception when stress-testing it with many connections

在对我的大量连接的TCP服务器进行压力测试时,我发现连接请求会在一段时间后抛出SocketException。异常随机显示

Only one usage of each socket address (protocol/network address/port) is normally permitted.

No connection could be made because the target machine actively refused it.

作为其信息。

这通常会在几秒钟后随机发生,并且会在数万次连接和断开连接后发生。要连接,我使用本地端点 IPEndPoint clientEndPoint = new(IPAddress.Any, 0);,我相信这会给我下一个免费的临时端口。

为了隔离问题,我编写了这个简单的程序,运行它既是一个 TCP 服务器,又是一个简单计数器的许多并行客户端:

using System.Diagnostics;
using System.Net;
using System.Net.Sockets;

CancellationTokenSource cancellationTokenSource = new();
CancellationToken cancellationToken = cancellationTokenSource.Token;

const int serverPort = 65000;
const int counterRequestMessage = -1;
const int randomCounterResponseMinDelay = 10; //ms
const int randomCounterResponseMaxDelay = 1000; //ms
const int maxParallelCounterRequests = 10000;

#region server

int counterValue = 0;

async void RunCounterServer()
{
    TcpListener listener = new(IPAddress.Any, serverPort);
    listener.Start(maxParallelCounterRequests);
    while (!cancellationToken.IsCancellationRequested)
    {
        HandleCounterRequester(await listener.AcceptTcpClientAsync(cancellationToken));
    }

    listener.Stop();
}

async void HandleCounterRequester(TcpClient client)
{
    await using NetworkStream stream = client.GetStream();
    Memory<byte> memory = new byte[sizeof(int)];

    //read requestMessage
    await stream.ReadAsync(memory, cancellationToken);
    int requestMessage = BitConverter.ToInt32(memory.Span);
    Debug.Assert(requestMessage == counterRequestMessage);

    //increment counter
    int updatedCounterValue = Interlocked.Add(ref counterValue, 1);
    Debug.Assert(BitConverter.TryWriteBytes(memory.Span, updatedCounterValue));

    //wait random timeout
    await Task.Delay(GetRandomCounterResponseDelay());

    //write back response
    await stream.WriteAsync(memory, cancellationToken);

    client.Close();
    client.Dispose();
}

int GetRandomCounterResponseDelay()
{
    return Random.Shared.Next(randomCounterResponseMinDelay, randomCounterResponseMaxDelay);
}

RunCounterServer();

#endregion

IPEndPoint clientEndPoint = new(IPAddress.Any, 0);
IPEndPoint serverEndPoint = new(IPAddress.Parse("127.0.0.1"), serverPort);
ReaderWriterLockSlim isExceptionEncounteredLock = new(LockRecursionPolicy.NoRecursion);
bool isExceptionEncountered = false;

async Task RunCounterClient()
{
    try
    {
        int counterResponse;
        using (TcpClient client = new(clientEndPoint))
        {
            await client.ConnectAsync(serverEndPoint, cancellationToken);

            await using (NetworkStream stream = client.GetStream())
            {
                Memory<byte> memory = new byte[sizeof(int)];

                //send counter request
                Debug.Assert(BitConverter.TryWriteBytes(memory.Span, counterRequestMessage));
                await stream.WriteAsync(memory, cancellationToken);

                //read counter response
                await stream.ReadAsync(memory, cancellationToken);
                counterResponse = BitConverter.ToInt32(memory.Span);
            }

            client.Close();
        }

        isExceptionEncounteredLock.EnterReadLock();
        //log response if there was no exception encountered so far
        if (!isExceptionEncountered)
        {
            Console.WriteLine(counterResponse);
        }

        isExceptionEncounteredLock.ExitReadLock();
    }
    catch (SocketException exception)
    {
        bool isFirstEncounteredException = false;

        isExceptionEncounteredLock.EnterWriteLock();

        //log exception and note that one was encountered if it is the first one
        if (!isExceptionEncountered)
        {
            Console.WriteLine(exception.Message);
            isExceptionEncountered = true;
            isFirstEncounteredException = true;
        }

        isExceptionEncounteredLock.ExitWriteLock();

        //if this is the first exception encountered, rethrow it
        if (isFirstEncounteredException)
        {
            throw;
        }
    }
}

async void RunParallelCounterClients()
{
    SemaphoreSlim clientSlotCount = new(maxParallelCounterRequests, maxParallelCounterRequests);

    async void RunCounterClientAndReleaseSlot()
    {
        await RunCounterClient();
        clientSlotCount.Release();
    }

    while (!cancellationToken.IsCancellationRequested)
    {
        await clientSlotCount.WaitAsync(cancellationToken);
        RunCounterClientAndReleaseSlot();
    }
}

RunParallelCounterClients();

while (true)
{
    ConsoleKeyInfo keyInfo = Console.ReadKey(true);
    if (keyInfo.Key == ConsoleKey.Escape)
    {
        cancellationTokenSource.Cancel();
        break;
    }
}

我最初的猜测是,我 运行 离开了临时端口,因为我不知何故没有正确释放它们。当请求完成时,我在我的客户端和服务器代码中只 Close()Dispose() 我的 TcpClients。我以为这会自动释放端口,但是当我在控制台中使用 netstat -ab 时,它会给我无数这样的条目,即使在关闭应用程序之后也是如此:

TCP    127.0.0.1:65000        kubernetes:59996       TIME_WAIT
TCP    127.0.0.1:65000        kubernetes:59997       TIME_WAIT
TCP    127.0.0.1:65000        kubernetes:59998       TIME_WAIT
TCP    127.0.0.1:65000        kubernetes:59999       TIME_WAIT
TCP    127.0.0.1:65000        kubernetes:60000       TIME_WAIT
TCP    127.0.0.1:65000        kubernetes:60001       TIME_WAIT
TCP    127.0.0.1:65000        kubernetes:60002       TIME_WAIT
TCP    127.0.0.1:65000        kubernetes:60003       TIME_WAIT
TCP    127.0.0.1:65000        kubernetes:60004       TIME_WAIT
TCP    127.0.0.1:65000        kubernetes:60005       TIME_WAIT
TCP    127.0.0.1:65000        kubernetes:60006       TIME_WAIT
TCP    127.0.0.1:65000        kubernetes:60007       TIME_WAIT
TCP    127.0.0.1:65000        kubernetes:60009       TIME_WAIT

此外,我的 PC 有时在退出应用程序后有时会卡顿很多。我认为这是由于 Windows 清理了我泄露的端口使用?

所以我想知道,我在这里做错了什么?

Only one usage of each socket address (protocol/network address/port) is normally permitted. ... My initial guess is, that I run out of ephemeral ports because I somehow do not free them correctly.

TIME_WAIT 是一个完全正常的状态,每个 TCP 连接在连接被主动关闭时都会进入,即在发送数据后显式调用关闭或在退出应用程序时隐式关闭。请参阅此图(来源 https://en.wikipedia.org/wiki/File:Tcp_state_diagram_fixed.svg):

离开TIME_WAIT状态进入CLOSED需要一段时间。只要连接在 TIME_OUT 源 ip、端口和目标 ip 的特定组合,端口就不能用于新连接。这有效地限制了在一段时间内从一个特定 IP 地址到另一个特定 IP 地址的可能连接数。请注意,典型的服务器不会 运行 进入这样的限制,因为它们从不同的系统获得许多连接,而从每个源 IP 获得的连接只有几个。

除了不主动关闭连接之外,对此我们无能为力。如果另一方首先触发连接(发送 FIN)并继续关闭(确认 FIN 并发送自己的 FIN),则不会发生 TIME_WAIT。当然,在您的单个客户端和单个服务器的特定场景中,这只会将问题转移到服务器上。

No connection could be made because the target machine actively refused it.

这还有一个原因。服务器在套接字上执行 listen 并给出预期的积压大小(OS 可能不会完全使用此值)。此积压用于在 OS 内核中接受新的 TCP 连接,服务器将调用 accept 以获取这些已接受的 TCP 连接。如果服务器调用 accept 的频率低于建立新连接的频率,积压将填满。一旦积压已满,服务器将拒绝新连接,从而导致您看到的错误。换句话说:如果服务器跟不上客户端,就会发生这种情况。