发送 SIGCONT 失败且行为不可预测 - linux

Sending a SIGCONT fails silently with unpredictable behavior - linux

我正在研究 linux 上的进程和信号,下面是我用 C 编写的一个简单测试:

#include <stdlib.h>
#include <stdio.h>
#include <ctype.h>
#include <mqueue.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <signal.h>
#include <fcntl.h>
#include <string.h>
#include <errno.h>
#include <stdbool.h>

void work(void);

int main(void) {
    pid_t children[10];
    for(size_t i = 0; i < 10; i++) {
        pid_t pid = fork();
        if(pid == -1) {
            perror("parent: error forking");
            return EXIT_FAILURE;
        }
        if(pid == 0) {
            raise(SIGSTOP); // child stops itself
            work(); // after resuming it goes on to execute work()
            return EXIT_SUCCESS; // and finally, it successfully terminates
        } else {
            fprintf(stdout, "parent: spawned child (%d)\n", pid);
            children[i] = pid;
        }
    }

    // parent spawned all 10 children who are now stopped - begin resuming them one by one
    for(size_t i = 0; i < 10; i++) {
        fprintf(stdout, "parent: signaling child (%d) to continue...\n", children[i]);
        if(kill(children[i], SIGCONT) == -1) {
            fprintf(stderr, "parent: error signalling child (%d) to continue: %s\n", children[i], strerror(errno));
        }
    }

    return EXIT_SUCCESS; // exit from parent once all children have been resumed
}

void work(void) {
    pid_t mypid = getpid();
    srand(mypid);
    int32_t sleep_time = (rand() % 10) + 1;
    fprintf(stdout, "(%d): began sleeping for %d seconds\n", mypid, sleep_time);
    sleep(sleep_time);
    fprintf(stdout, "(%d): done sleeping after %d seconds\n", mypid, sleep_time);
}

想法如下:


父进程产生 10 个子进程,每个子进程在产生后立即向自己发送一个 SIGSTOP。一旦父进程成功生成所有 10 个进程,它就会立即开始向所有 10 个进程发送 SIGCONT

一旦子进程恢复,它就会开始执行 work()(它只是暂停执行 0 到 10 秒之间的随机时间,同时将信息打印到标准输出),之后它成功终止.


这是成功输出的样子:

[I] bogdan in ~/dev/mserve
>>  ./prog
parent: spawned child (138655)
parent: spawned child (138656)
parent: spawned child (138657)
parent: spawned child (138658)
parent: spawned child (138659)
parent: spawned child (138660)
parent: spawned child (138661)
parent: spawned child (138662)
parent: spawned child (138663)
parent: spawned child (138664)
parent: signaling child (138655) to continue...
(138655): began sleeping for 9 seconds
parent: signaling child (138656) to continue...
(138656): began sleeping for 3 seconds
parent: signaling child (138657) to continue...
parent: signaling child (138658) to continue...
parent: signaling child (138659) to continue...
parent: signaling child (138660) to continue...
parent: signaling child (138661) to continue...
parent: signaling child (138662) to continue...
parent: signaling child (138663) to continue...
parent: signaling child (138664) to continue...
(138659): began sleeping for 4 seconds
(138657): began sleeping for 5 seconds
(138658): began sleeping for 7 seconds
(138660): began sleeping for 10 seconds
(138663): began sleeping for 3 seconds
(138662): began sleeping for 7 seconds
(138664): began sleeping for 7 seconds
(138661): began sleeping for 2 seconds
[I] bogdan in ~/dev/mserve
(138661): done sleeping after 2 seconds
(138656): done sleeping after 3 seconds
(138663): done sleeping after 3 seconds
(138659): done sleeping after 4 seconds
(138657): done sleeping after 5 seconds
(138658): done sleeping after 7 seconds
(138662): done sleeping after 7 seconds
(138664): done sleeping after 7 seconds
(138655): done sleeping after 9 seconds
(138660): done sleeping after 10 seconds

如信息消息所示,所有 10 个进程都已成功完成睡眠并终止。

问题

也许每 3 次中有一次,10 个子进程中的随机数得到 "stuck" 并且在 SIGSTOP 后无法恢复。来自发送 SIGCONT 的父进程的 kill(2) 成功,但进程仍处于挂起状态。

输出如下所示:

[I] bogdan in ~/dev/mserve
>  ./alt
parent: spawned child (139369)
parent: spawned child (139370)
parent: spawned child (139371)
parent: spawned child (139372)
parent: spawned child (139373)
parent: spawned child (139374)
parent: spawned child (139375)
parent: spawned child (139376)
parent: spawned child (139377)
parent: spawned child (139378)
parent: signaling child (139369) to continue...
parent: signaling child (139370) to continue...
parent: signaling child (139371) to continue...
parent: signaling child (139372) to continue...
parent: signaling child (139373) to continue...
parent: signaling child (139374) to continue...
parent: signaling child (139375) to continue...
parent: signaling child (139376) to continue...
parent: signaling child (139377) to continue...
parent: signaling child (139378) to continue...
(139371): began sleeping for 4 seconds
(139369): began sleeping for 8 seconds
(139373): began sleeping for 7 seconds
(139370): began sleeping for 3 seconds
(139375): began sleeping for 9 seconds
(139372): began sleeping for 7 seconds
(139374): began sleeping for 10 seconds
(139376): began sleeping for 8 seconds
(139377): began sleeping for 7 seconds
[I] bogdan in ~/dev/mserve
(139370): done sleeping after 3 seconds
(139371): done sleeping after 4 seconds
(139373): done sleeping after 7 seconds
(139372): done sleeping after 7 seconds
(139377): done sleeping after 7 seconds
(139369): done sleeping after 8 seconds
(139376): done sleeping after 8 seconds
(139375): done sleeping after 9 seconds
(139374): done sleeping after 10 seconds

这次只有 9 个进程成功完成(打印了 9 "done sleeping" 条消息)。

通过在 shell 中执行 $ ps au 我可以观察 "stuck" 进程(注意 T 状态):

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
bogdan    139378  0.0  0.0   2312    80 pts/3    T    20:31   0:00 ./prog

我什至可以从我的shell发出信号让他们继续:

$ kill -SIGCONT 139378
(139378): began sleeping for 5 seconds
...
(139378): done sleeping after 5 seconds

另一个奇怪的细节

使用 strace(例如 $ strace ./program)执行父进程时,问题从未发生,所有 10 个进程都在 100% 的时间内正确恢复。只有当我直接从我的 shell 执行父级时,我才能观察到问题。

我已经浏览了几次 signal(7) 联机帮助页,但我不明白为什么会这样。

发送自身 SIGSTOP 的子进程与发送 SIGCONT 的父进程之间似乎存在竞争条件。有时父进程在子进程发送 SIGSTOP 之前发送 SIGCONT,因此子进程挂起。

最有可能的情况是,parent 将 SIGCONT 传递给尚未自行停止的 child。这样的信号将被忽略,因为进程在处理时并未停止。

您的程序中没有任何内容可以阻止这种情况的发生;相反,您只是依靠 children 比 parent 更快地停止发出信号——这是一种竞争条件。您可以通过向其发送(附加的)SIGCONT 来恢复卡住的进程这一事实与此诊断一致,并且 strace 影响时间足以 children 是合理的总是赢得比赛。