将微服务（使用 ZeroMQ、C 和 Python 进程）从 64 位硬件移动到 32 位硬件后的延迟，但标称 cpu 使用

Question

我有两个用 C 编写的进程，它们设置 PUSH/PULL ZeroMQ 套接字和 Python 进程中的两个线程，镜像 PUSH/PULL 套接字。每秒大约有 80 - 300 条轻量级（<30 字节）消息从 C 进程发送到 Python 进程，10-30 条类似消息从 Python 进程发送到 C 进程。

我在 64 位 ARMv8（Ubuntu 基础）和 AMD64（Ubuntu 18.04）上运行这些服务，没有明显的延迟。我在基于 Linux 的 32 位系统上尝试了运行完全相同的服务，并且震惊地看到消息在 30 多秒后才通过，即使在终止 C 服务之后也是如此。检查 CPU 使用情况时，30-40% 的使用率非常平稳，似乎不是瓶颈。

我的 ZeroMQ 套接字设置在系统之间没有变化，我将 LINGER 设置为 0，我尝试 RCVTIMEO 在 0 到 100 毫秒之间，我尝试在 0 之间改变 BACKLOG和 50，两者都没有区别。我尝试使用多个 IO 线程并设置套接字线程亲和性，但也无济于事。对于 PUSH 套接字，我将连接 tcp://localhost:##### 上的套接字并将 PULL 套接字绑定到 tcp://*:#####。我也用了ipc:///tmp/...，消息正在发送和接收，但是在32位系统上仍然存在延迟。

我调查了接收消息之间的其他 Python 个步骤，它们似乎最多不会超过 1 毫秒。当我为 socket.recv(0) 计时时，它高达 0.02 秒，即使该套接字的 RCVTIMEO 设置为 0。

为什么我会在新的 32 位平台上而不是在其他平台上看到这种行为，有什么建议吗？我可能找错地方了吗？

这里有一些代码可以帮助解释：

连接和_recv() class-方法大致描述如下：

    def _connect(self):
        self.context = zmq.Context(4)
        self.sink = self.context.socket(zmq.PULL)
        self.sink.setsockopt(zmq.LINGER, 0)
        self.sink.setsockopt(zmq.RCVTIMEO, 100)
        self.sink.setsockopt(zmq.BACKLOG, 0)
        self.sink.bind("tcp://*:55755")

    def _recv(self):
        while True:
            msg = None
            try:
                msg = self.sink.recv(0)  # Use blocking or zmq.NOBLOCK, still appears to be slow
            except zmq.Error
                ... meaningful exception handle here

            # This last step, when timed usually takes less than a millisecond to process
            if msg:
                msg_dict = utils.bytestream_to_dict(msg)  # unpacking step (negligible)
                if msg_dict:
                    self.parser.parse(msg_dict)  # parser is a dict of callbacks also negligible

在 C 进程端

    zmq_init (4);

    void *context = zmq_ctx_new ();

    /* Connect the Sender */
    void *vent = zmq_socket (context, ZMQ_PUSH);

    int timeo = 0;
    int timeo_ret = zmq_setsockopt(vent, ZMQ_SNDTIMEO, &timeo, sizeof(timeo));
    if (timeo_ret != 0)
        error("Failed to set ZMQ recv timeout because %s", zmq_strerror(errno));

    int linger = 100;
    int linger_ret = zmq_setsockopt(vent, ZMQ_LINGER, &linger, sizeof(linger));
    if (linger_ret != 0)
        error("Failed to set ZMQ linger because %s", zmq_strerror(errno));

    if (zmq_connect (vent, vent_port) == 0)
        info("Successfully initialized ZeroMQ ventilator on %s", vent_port);
    else {
        error("Failed to initialize %s ZeroMQ ventilator with error %s", sink_port, 
               zmq_strerror(errno));
        ret = 1;
    }

    ...

    /* When a message needs to be sent it's instantly hitting this where msg is a char* */
    ret = zmq_send(vent, msg, msg_len, ZMQ_NOBLOCK);

On docker 运行在目标 32 位系统上 lstopo - -v --no-io

Machine (P#0 local=1019216KB total=1019216KB HardwareName="Freescale i.MX6 Quad/DualLite (Device Tree)" HardwareRevision=0000 HardwareSerial=0000000000000000 Backend=Linux LinuxCgroup=/docker/d2b0a3b3a5eedb7e10fc89fdee6e8493716a359597ac61350801cc302d79b8c0 OSName=Linux OSRelease=3.10.54-dey+g441c8d4 OSVersion="#1 SMP PREEMPT RT Tue Jan 28 12:11:37 CST 2020" HostName=db1docker Architecture=armv7l hwlocVersion=1.11.12 ProcessName=lstopo)
  Package L#0 (P#0 CPUModel="ARMv7 Processor rev 10 (v7l)" CPUImplementer=0x41 CPUArchitecture=7 CPUVariant=0x2 CPUPart=0xc09 CPURevision=10)
    Core L#0 (P#0)
      PU L#0 (P#0)
    Core L#1 (P#1)
      PU L#1 (P#1)
    Core L#2 (P#2)
      PU L#2 (P#2)
    Core L#3 (P#3)
      PU L#3 (P#3)
depth 0:        1 Machine (type #1)
 depth 1:       1 Package (type #3)
  depth 2:      4 Core (type #5)
   depth 3:     4 PU (type #6)

编辑：

通过禁用几乎所有其他工作线程，我们能够使目标机器上的延迟消失。

Answer 1

Q : roughly 80 - 300 light weight (<30 bytes) messages per second being sent from the C process to the Python process, and 10-30 similar messages from the Python process to the C process.

a ）关于从 python 向 C 发送任何消息的信息为零（不包含在发布的源代码中，只有 C PUSH-es 到 python )

b ) 300 [Hz] < 30 B 有效负载在 ZeroMQ 能力方面不算什么

c ) python 是，从那时起（而且几乎可以肯定会一直如此），一个纯粹的[SERIAL] 在意义上，无论有多少Thread-实例，所以任何执行都必须等到它获得 POSACK'ed GIL 锁所有权，阻止任何其他工作，从而重新设置一个纯- [SERIAL] 一个接一个地工作……但要增加 GIL 锁握手的额外成本。

d ) 在同一硬件平台上给定所有进程运行（请参阅 tcp://localhost... specified ), 没有理由生成多达 ( 4 + 4 )-IO 线程（其中 python 不能 "harness"-em 一次只读一个线程（慢动作），只给出 4-CPU-cores 上面由 [= 报告20=] 摘录：

Machine (995MB)
+Package L#0
 Core L#0 +PU L#0 (P#0)
 Core L#1 +PU L#1 (P#1)
 Core L#2 +PU L#2 (P#2)
 Core L#3 +PU L#3 (P#3)

e) ISO-OSI-L2/L3 参数调整是有意义的，但毕竟更大的延迟源被削减了。

f) 最后但同样重要的是，运行 python pystone 测试（在两个原始平台上和目标硬件平台），以查看 i.MX6-CPU-powered python 的实际相对性能，以便能够进行比较苹果对苹果

_{Running pystone on the target machine results in: This machine benchmarks at 10188.5 pystones/second and my host machine is 274264 pystones/second}

所以，部署到 i.MX6-target 的问题不仅仅是它的 32 位 O/S 本身，还有 27x 超额订阅 IO 线程的处理速度较慢 （线程 4+4 多于 4-CPU-cores）不会改善消息流。

更好的视野，由 lstopo-no-graphics -.ascii

提供

    ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
    │ Machine (31876MB)                                                                                                 │
    │                                                                                                                   │
    │ ┌────────────────────────────────────────────────────────────┐                      ┌───────────────────────────┐ │
    │ │ Package P#0                                                │  ├┤╶─┬─────┼┤╶───────┤ PCI 10ae:1F44             │ │
    │ │                                                            │      │               │                           │ │
    │ │ ┌────────────────────────────────────────────────────────┐ │      │               │ ┌────────────┐  ┌───────┐ │ │
    │ │ │ L3 (8192KB)                                            │ │      │               │ │ renderD128 │  │ card0 │ │ │
    │ │ └────────────────────────────────────────────────────────┘ │      │               │ └────────────┘  └───────┘ │ │
    │ │                                                            │      │               │                           │ │
    │ │ ┌──────────────────────────┐  ┌──────────────────────────┐ │      │               │ ┌────────────┐            │ │
    │ │ │ L2 (2048KB)              │  │ L2 (2048KB)              │ │      │               │ │ controlD64 │            │ │
    │ │ └──────────────────────────┘  └──────────────────────────┘ │      │               │ └────────────┘            │ │
    │ │                                                            │      │               └───────────────────────────┘ │
    │ │ ┌──────────────────────────┐  ┌──────────────────────────┐ │      │                                             │
    │ │ │ L1i (64KB)               │  │ L1i (64KB)               │ │      │               ┌───────────────┐             │
    │ │ └──────────────────────────┘  └──────────────────────────┘ │      ├─────┼┤╶───────┤ PCI 10bc:8268 │             │
    │ │                                                            │      │               │               │             │
    │ │ ┌────────────┐┌────────────┐  ┌────────────┐┌────────────┐ │      │               │ ┌────────┐    │             │
    │ │ │ L1d (16KB) ││ L1d (16KB) │  │ L1d (16KB) ││ L1d (16KB) │ │      │               │ │ enp2s0 │    │             │
    │ │ └────────────┘└────────────┘  └────────────┘└────────────┘ │      │               │ └────────┘    │             │
    │ │                                                            │      │               └───────────────┘             │
    │ │ ┌────────────┐┌────────────┐  ┌────────────┐┌────────────┐ │      │                                             │
    │ │ │ Core P#0   ││ Core P#1   │  │ Core P#2   ││ Core P#3   │ │      │     ┌──────────────────┐                    │
    │ │ │            ││            │  │            ││            │ │      ├─────┤ PCI 1002:4790    │                    │
    │ │ │ ┌────────┐ ││ ┌────────┐ │  │ ┌────────┐ ││ ┌────────┐ │ │      │     │                  │                    │
    │ │ │ │ PU P#0 │ ││ │ PU P#1 │ │  │ │ PU P#2 │ ││ │ PU P#3 │ │ │      │     │ ┌─────┐  ┌─────┐ │                    │
    │ │ │ └────────┘ ││ └────────┘ │  │ └────────┘ ││ └────────┘ │ │      │     │ │ sr0 │  │ sda │ │                    │
    │ │ └────────────┘└────────────┘  └────────────┘└────────────┘ │      │     │ └─────┘  └─────┘ │                    │
    │ └────────────────────────────────────────────────────────────┘      │     └──────────────────┘                    │
    │                                                                     │                                             │
    │                                                                     │     ┌───────────────┐                       │
    │                                                                     └─────┤ PCI 1002:479c │                       │
    │                                                                           └───────────────┘                       │
    └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

将微服务（使用 ZeroMQ、C 和 Python 进程）从 64 位硬件移动到 32 位硬件后的延迟，但标称 cpu 使用

Latency after moving micro-service (using ZeroMQ, C, & Python processes) from 64 bit hardware to 32 bit hardware, but nominal cpu usage

c

sockets

zeromq

32bit-64bit

python-3.x