发送 SIGCONT 失败且行为不可预测 - linux
Sending a SIGCONT fails silently with unpredictable behavior - linux
我正在研究 linux 上的进程和信号,下面是我用 C 编写的一个简单测试:
#include <stdlib.h>
#include <stdio.h>
#include <ctype.h>
#include <mqueue.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <signal.h>
#include <fcntl.h>
#include <string.h>
#include <errno.h>
#include <stdbool.h>
void work(void);
int main(void) {
pid_t children[10];
for(size_t i = 0; i < 10; i++) {
pid_t pid = fork();
if(pid == -1) {
perror("parent: error forking");
return EXIT_FAILURE;
}
if(pid == 0) {
raise(SIGSTOP); // child stops itself
work(); // after resuming it goes on to execute work()
return EXIT_SUCCESS; // and finally, it successfully terminates
} else {
fprintf(stdout, "parent: spawned child (%d)\n", pid);
children[i] = pid;
}
}
// parent spawned all 10 children who are now stopped - begin resuming them one by one
for(size_t i = 0; i < 10; i++) {
fprintf(stdout, "parent: signaling child (%d) to continue...\n", children[i]);
if(kill(children[i], SIGCONT) == -1) {
fprintf(stderr, "parent: error signalling child (%d) to continue: %s\n", children[i], strerror(errno));
}
}
return EXIT_SUCCESS; // exit from parent once all children have been resumed
}
void work(void) {
pid_t mypid = getpid();
srand(mypid);
int32_t sleep_time = (rand() % 10) + 1;
fprintf(stdout, "(%d): began sleeping for %d seconds\n", mypid, sleep_time);
sleep(sleep_time);
fprintf(stdout, "(%d): done sleeping after %d seconds\n", mypid, sleep_time);
}
想法如下:
父进程产生 10 个子进程,每个子进程在产生后立即向自己发送一个 SIGSTOP
。一旦父进程成功生成所有 10 个进程,它就会立即开始向所有 10 个进程发送 SIGCONT
。
一旦子进程恢复,它就会开始执行 work()
(它只是暂停执行 0 到 10 秒之间的随机时间,同时将信息打印到标准输出),之后它成功终止.
这是成功输出的样子:
[I] bogdan in ~/dev/mserve
>> ./prog
parent: spawned child (138655)
parent: spawned child (138656)
parent: spawned child (138657)
parent: spawned child (138658)
parent: spawned child (138659)
parent: spawned child (138660)
parent: spawned child (138661)
parent: spawned child (138662)
parent: spawned child (138663)
parent: spawned child (138664)
parent: signaling child (138655) to continue...
(138655): began sleeping for 9 seconds
parent: signaling child (138656) to continue...
(138656): began sleeping for 3 seconds
parent: signaling child (138657) to continue...
parent: signaling child (138658) to continue...
parent: signaling child (138659) to continue...
parent: signaling child (138660) to continue...
parent: signaling child (138661) to continue...
parent: signaling child (138662) to continue...
parent: signaling child (138663) to continue...
parent: signaling child (138664) to continue...
(138659): began sleeping for 4 seconds
(138657): began sleeping for 5 seconds
(138658): began sleeping for 7 seconds
(138660): began sleeping for 10 seconds
(138663): began sleeping for 3 seconds
(138662): began sleeping for 7 seconds
(138664): began sleeping for 7 seconds
(138661): began sleeping for 2 seconds
[I] bogdan in ~/dev/mserve
(138661): done sleeping after 2 seconds
(138656): done sleeping after 3 seconds
(138663): done sleeping after 3 seconds
(138659): done sleeping after 4 seconds
(138657): done sleeping after 5 seconds
(138658): done sleeping after 7 seconds
(138662): done sleeping after 7 seconds
(138664): done sleeping after 7 seconds
(138655): done sleeping after 9 seconds
(138660): done sleeping after 10 seconds
如信息消息所示,所有 10 个进程都已成功完成睡眠并终止。
问题
也许每 3 次中有一次,10 个子进程中的随机数得到 "stuck" 并且在 SIGSTOP 后无法恢复。来自发送 SIGCONT 的父进程的 kill(2)
成功,但进程仍处于挂起状态。
输出如下所示:
[I] bogdan in ~/dev/mserve
> ./alt
parent: spawned child (139369)
parent: spawned child (139370)
parent: spawned child (139371)
parent: spawned child (139372)
parent: spawned child (139373)
parent: spawned child (139374)
parent: spawned child (139375)
parent: spawned child (139376)
parent: spawned child (139377)
parent: spawned child (139378)
parent: signaling child (139369) to continue...
parent: signaling child (139370) to continue...
parent: signaling child (139371) to continue...
parent: signaling child (139372) to continue...
parent: signaling child (139373) to continue...
parent: signaling child (139374) to continue...
parent: signaling child (139375) to continue...
parent: signaling child (139376) to continue...
parent: signaling child (139377) to continue...
parent: signaling child (139378) to continue...
(139371): began sleeping for 4 seconds
(139369): began sleeping for 8 seconds
(139373): began sleeping for 7 seconds
(139370): began sleeping for 3 seconds
(139375): began sleeping for 9 seconds
(139372): began sleeping for 7 seconds
(139374): began sleeping for 10 seconds
(139376): began sleeping for 8 seconds
(139377): began sleeping for 7 seconds
[I] bogdan in ~/dev/mserve
(139370): done sleeping after 3 seconds
(139371): done sleeping after 4 seconds
(139373): done sleeping after 7 seconds
(139372): done sleeping after 7 seconds
(139377): done sleeping after 7 seconds
(139369): done sleeping after 8 seconds
(139376): done sleeping after 8 seconds
(139375): done sleeping after 9 seconds
(139374): done sleeping after 10 seconds
这次只有 9 个进程成功完成(打印了 9 "done sleeping" 条消息)。
通过在 shell 中执行 $ ps au
我可以观察 "stuck" 进程(注意 T
状态):
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
bogdan 139378 0.0 0.0 2312 80 pts/3 T 20:31 0:00 ./prog
我什至可以从我的shell发出信号让他们继续:
$ kill -SIGCONT 139378
(139378): began sleeping for 5 seconds
...
(139378): done sleeping after 5 seconds
另一个奇怪的细节
使用 strace
(例如 $ strace ./program
)执行父进程时,问题从未发生,所有 10 个进程都在 100% 的时间内正确恢复。只有当我直接从我的 shell 执行父级时,我才能观察到问题。
我已经浏览了几次 signal(7)
联机帮助页,但我不明白为什么会这样。
发送自身 SIGSTOP
的子进程与发送 SIGCONT
的父进程之间似乎存在竞争条件。有时父进程在子进程发送 SIGSTOP
之前发送 SIGCONT
,因此子进程挂起。
最有可能的情况是,parent 将 SIGCONT
传递给尚未自行停止的 child。这样的信号将被忽略,因为进程在处理时并未停止。
您的程序中没有任何内容可以阻止这种情况的发生;相反,您只是依靠 children 比 parent 更快地停止发出信号——这是一种竞争条件。您可以通过向其发送(附加的)SIGCONT
来恢复卡住的进程这一事实与此诊断一致,并且 strace
影响时间足以 children 是合理的总是赢得比赛。
我正在研究 linux 上的进程和信号,下面是我用 C 编写的一个简单测试:
#include <stdlib.h>
#include <stdio.h>
#include <ctype.h>
#include <mqueue.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <signal.h>
#include <fcntl.h>
#include <string.h>
#include <errno.h>
#include <stdbool.h>
void work(void);
int main(void) {
pid_t children[10];
for(size_t i = 0; i < 10; i++) {
pid_t pid = fork();
if(pid == -1) {
perror("parent: error forking");
return EXIT_FAILURE;
}
if(pid == 0) {
raise(SIGSTOP); // child stops itself
work(); // after resuming it goes on to execute work()
return EXIT_SUCCESS; // and finally, it successfully terminates
} else {
fprintf(stdout, "parent: spawned child (%d)\n", pid);
children[i] = pid;
}
}
// parent spawned all 10 children who are now stopped - begin resuming them one by one
for(size_t i = 0; i < 10; i++) {
fprintf(stdout, "parent: signaling child (%d) to continue...\n", children[i]);
if(kill(children[i], SIGCONT) == -1) {
fprintf(stderr, "parent: error signalling child (%d) to continue: %s\n", children[i], strerror(errno));
}
}
return EXIT_SUCCESS; // exit from parent once all children have been resumed
}
void work(void) {
pid_t mypid = getpid();
srand(mypid);
int32_t sleep_time = (rand() % 10) + 1;
fprintf(stdout, "(%d): began sleeping for %d seconds\n", mypid, sleep_time);
sleep(sleep_time);
fprintf(stdout, "(%d): done sleeping after %d seconds\n", mypid, sleep_time);
}
想法如下:
父进程产生 10 个子进程,每个子进程在产生后立即向自己发送一个 SIGSTOP
。一旦父进程成功生成所有 10 个进程,它就会立即开始向所有 10 个进程发送 SIGCONT
。
一旦子进程恢复,它就会开始执行 work()
(它只是暂停执行 0 到 10 秒之间的随机时间,同时将信息打印到标准输出),之后它成功终止.
这是成功输出的样子:
[I] bogdan in ~/dev/mserve
>> ./prog
parent: spawned child (138655)
parent: spawned child (138656)
parent: spawned child (138657)
parent: spawned child (138658)
parent: spawned child (138659)
parent: spawned child (138660)
parent: spawned child (138661)
parent: spawned child (138662)
parent: spawned child (138663)
parent: spawned child (138664)
parent: signaling child (138655) to continue...
(138655): began sleeping for 9 seconds
parent: signaling child (138656) to continue...
(138656): began sleeping for 3 seconds
parent: signaling child (138657) to continue...
parent: signaling child (138658) to continue...
parent: signaling child (138659) to continue...
parent: signaling child (138660) to continue...
parent: signaling child (138661) to continue...
parent: signaling child (138662) to continue...
parent: signaling child (138663) to continue...
parent: signaling child (138664) to continue...
(138659): began sleeping for 4 seconds
(138657): began sleeping for 5 seconds
(138658): began sleeping for 7 seconds
(138660): began sleeping for 10 seconds
(138663): began sleeping for 3 seconds
(138662): began sleeping for 7 seconds
(138664): began sleeping for 7 seconds
(138661): began sleeping for 2 seconds
[I] bogdan in ~/dev/mserve
(138661): done sleeping after 2 seconds
(138656): done sleeping after 3 seconds
(138663): done sleeping after 3 seconds
(138659): done sleeping after 4 seconds
(138657): done sleeping after 5 seconds
(138658): done sleeping after 7 seconds
(138662): done sleeping after 7 seconds
(138664): done sleeping after 7 seconds
(138655): done sleeping after 9 seconds
(138660): done sleeping after 10 seconds
如信息消息所示,所有 10 个进程都已成功完成睡眠并终止。
问题
也许每 3 次中有一次,10 个子进程中的随机数得到 "stuck" 并且在 SIGSTOP 后无法恢复。来自发送 SIGCONT 的父进程的 kill(2)
成功,但进程仍处于挂起状态。
输出如下所示:
[I] bogdan in ~/dev/mserve
> ./alt
parent: spawned child (139369)
parent: spawned child (139370)
parent: spawned child (139371)
parent: spawned child (139372)
parent: spawned child (139373)
parent: spawned child (139374)
parent: spawned child (139375)
parent: spawned child (139376)
parent: spawned child (139377)
parent: spawned child (139378)
parent: signaling child (139369) to continue...
parent: signaling child (139370) to continue...
parent: signaling child (139371) to continue...
parent: signaling child (139372) to continue...
parent: signaling child (139373) to continue...
parent: signaling child (139374) to continue...
parent: signaling child (139375) to continue...
parent: signaling child (139376) to continue...
parent: signaling child (139377) to continue...
parent: signaling child (139378) to continue...
(139371): began sleeping for 4 seconds
(139369): began sleeping for 8 seconds
(139373): began sleeping for 7 seconds
(139370): began sleeping for 3 seconds
(139375): began sleeping for 9 seconds
(139372): began sleeping for 7 seconds
(139374): began sleeping for 10 seconds
(139376): began sleeping for 8 seconds
(139377): began sleeping for 7 seconds
[I] bogdan in ~/dev/mserve
(139370): done sleeping after 3 seconds
(139371): done sleeping after 4 seconds
(139373): done sleeping after 7 seconds
(139372): done sleeping after 7 seconds
(139377): done sleeping after 7 seconds
(139369): done sleeping after 8 seconds
(139376): done sleeping after 8 seconds
(139375): done sleeping after 9 seconds
(139374): done sleeping after 10 seconds
这次只有 9 个进程成功完成(打印了 9 "done sleeping" 条消息)。
通过在 shell 中执行 $ ps au
我可以观察 "stuck" 进程(注意 T
状态):
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
bogdan 139378 0.0 0.0 2312 80 pts/3 T 20:31 0:00 ./prog
我什至可以从我的shell发出信号让他们继续:
$ kill -SIGCONT 139378
(139378): began sleeping for 5 seconds
...
(139378): done sleeping after 5 seconds
另一个奇怪的细节
使用 strace
(例如 $ strace ./program
)执行父进程时,问题从未发生,所有 10 个进程都在 100% 的时间内正确恢复。只有当我直接从我的 shell 执行父级时,我才能观察到问题。
我已经浏览了几次 signal(7)
联机帮助页,但我不明白为什么会这样。
发送自身 SIGSTOP
的子进程与发送 SIGCONT
的父进程之间似乎存在竞争条件。有时父进程在子进程发送 SIGSTOP
之前发送 SIGCONT
,因此子进程挂起。
最有可能的情况是,parent 将 SIGCONT
传递给尚未自行停止的 child。这样的信号将被忽略,因为进程在处理时并未停止。
您的程序中没有任何内容可以阻止这种情况的发生;相反,您只是依靠 children 比 parent 更快地停止发出信号——这是一种竞争条件。您可以通过向其发送(附加的)SIGCONT
来恢复卡住的进程这一事实与此诊断一致,并且 strace
影响时间足以 children 是合理的总是赢得比赛。