克隆人的奇怪行为

Question

这是一个相当简单的应用程序，它通过 clone() 调用创建一个轻量级进程（线程）。

#define _GNU_SOURCE

#include <sched.h>
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <errno.h>
#include <stdlib.h>
#include <time.h>

#define STACK_SIZE 1024*1024

int func(void* param) {
    printf("I am func, pid %d\n", getpid());    
    return 0;
}

int main(int argc, char const *argv[]) {
    printf("I am main, pid %d\n", getpid());
    void* ptr = malloc(STACK_SIZE);

    printf("I am calling clone\n");             
    int res = clone(func, ptr + STACK_SIZE, CLONE_VM, NULL);
    // works fine with sleep() call
    // sleep(1);

    if (res == -1) {
        printf("clone error: %d", errno);       
    } else {
        printf("I created child with pid: %d\n", res);      
    }

    printf("Main done, pid %d\n", getpid());        
    return 0;
}

结果如下：

运行 1:

➜  LFD401 ./clone
I am main, pid 10974
I am calling clone
I created child with pid: 10975
Main done, pid 10974
I am func, pid 10975

运行 2:

➜  LFD401 ./clone
I am main, pid 10995
I am calling clone
I created child with pid: 10996
I created child with pid: 10996
I am func, pid 10996
Main done, pid 10995

运行 3:

➜  LFD401 ./clone
I am main, pid 11037
I am calling clone
I created child with pid: 11038
I created child with pid: 11038
I am func, pid 11038
I created child with pid: 11038
I am func, pid 11038
Main done, pid 11037

运行 4:

➜  LFD401 ./clone
I am main, pid 11062
I am calling clone
I created child with pid: 11063
Main done, pid 11062
Main done, pid 11062
I am func, pid 11063

这是怎么回事？为什么 "I created child" 消息有时会打印多次？

我还注意到在 clone 调用 "fixes" 之后添加延迟问题。

Answer 1

您的进程都使用相同的 stdout（即 C 标准库 FILE 结构），其中包括意外共享的缓冲区。这无疑会引起问题。

Answer 2

我无法重现 OP 的问题，但我认为 printf 实际上不是问题。

glibc docs:

The POSIX standard requires that by default the stream operations are atomic. I.e., issuing two stream operations for the same stream in two threads at the same time will cause the operations to be executed as if they were issued sequentially. The buffer operations performed while reading or writing are protected from other uses of the same stream. To do this each stream has an internal lock object which has to be (implicitly) acquired before any work can be done.

编辑：

尽管以上对于线程是正确的，正如 rici 指出的那样，sourceware 上有评论：

Basically, there's nothing you can safely do with CLONE_VM unless the child restricts itself to pure computation and direct syscalls (via sys/syscall.h). If you use any of the standard library, you risk the parent and child clobbering each other's internal states. You also have issues like the fact that glibc caches the pid/tid in userspace, and the fact that glibc expects to always have a valid thread pointer which your call to clone is unable to initialize correctly because it does not know (and should not know) the internal implementation of threads.

显然，如果 CLONE_VM 已设置但 CLONE_THREAD|CLONE_SIGHAND 未设置，则 glibc 不适用于克隆。

Answer 3

屁股大家建议：这好像真的是个问题，为了clone()，进程安全我该怎么说呢？通过 printf 锁定版本的粗略草图（使用 write(2)），输出符合预期。

#define _GNU_SOURCE

#include <sched.h>
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <errno.h>
#include <stdlib.h>
#include <time.h>

#define STACK_SIZE 1024*1024

// VERY rough attempt at a thread-safe printf
#include <stdarg.h>
#define SYNC_REALLOC_GROW 64
int sync_printf(const char *format, ...)
{
  int n, all = 0;
  int size = 256;
  char *p, *np;
  va_list args;

  if ((p = malloc(size)) == NULL)
    return -1;

  for (;;) {
    va_start(args, format);
    n = vsnprintf(p, size, format, args);
    va_end(args);
    if (n < 0)
      return -1;
    all += n;
    if (n < size)
      break;
    size = n + SYNC_REALLOC_GROW;
    if ((np = realloc(p, size)) == NULL) {
      free(p);
      return -1;
    } else {
      p = np;
    }
  }
  // write(2) shoudl be threadsafe, so just in case
  flockfile(stdout);
  n = (int) write(fileno(stdout), p, all);
  fflush(stdout);
  funlockfile(stdout);
  va_end(args);
  free(p);
  return n;
}


int func(void *param)
{
  sync_printf("I am func, pid %d\n", getpid());
  return 0;
}

int main()
{

  sync_printf("I am main, pid %d\n", getpid());
  void *ptr = malloc(STACK_SIZE);

  sync_printf("I am calling clone\n");
  int res = clone(func, ptr + STACK_SIZE, CLONE_VM, NULL);
  // works fine with sleep() call
  // sleep(1);

  if (res == -1) {
    sync_printf("clone error: %d", errno);
  } else {
    sync_printf("I created child with pid: %d\n", res);
  }
  sync_printf("Main done, pid %d\n\n", getpid());
  return 0;
}

第三次：这只是一个草图，没有时间做一个健壮的版本，但这不应该妨碍你写一个。

Answer 4

您有竞争条件（即）您没有 stdio 的隐含线程安全。

问题更严重了。您可能会收到重复的 "func" 条消息。

问题是使用 clone 没有与 pthread_create 相同的保证。（即）你 not 获得 printf.

的线程安全变体

我不确定，但是，IMO 关于 stdio 流和线程安全的废话实际上只适用于使用 pthreads。

因此，您必须处理自己的线程间锁定。

这是您的程序重新编码后使用的版本 pthread_create。它似乎可以正常工作：

#define _GNU_SOURCE

#include <sched.h>
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <errno.h>
#include <stdlib.h>
#include <time.h>
#include <pthread.h>

#define STACK_SIZE 1024*1024

void *func(void* param) {
    printf("I am func, pid %d\n", getpid());
    return (void *) 0;
}

int main(int argc, char const *argv[]) {
    printf("I am main, pid %d\n", getpid());
    void* ptr = malloc(STACK_SIZE);

    printf("I am calling clone\n");

    pthread_t tid;
    pthread_create(&tid,NULL,func,NULL);
    //int res = clone(func, ptr + STACK_SIZE, CLONE_VM, NULL);
    int res = 0;

    // works fine with sleep() call
    // sleep(1);

    if (res == -1) {
        printf("clone error: %d", errno);
    } else {
        printf("I created child with pid: %d\n", res);
    }

    pthread_join(tid,NULL);
    printf("Main done, pid %d\n", getpid());
    return 0;
}

这是我一直用来检查错误的测试脚本[有点粗糙，但应该没问题]。运行反对你的版本，它会很快中止。 pthread_create 版本似乎通过得很好

#!/usr/bin/perl
# clonetest -- clone test
#
# arguments:
#   "-p0" -- suppress check for duplicate parent messages
#   "-c0" -- suppress check for duplicate child messages
#   1 -- base name for program to test (e.g. for xyz.c, use xyz)
#   2 -- [optional] number of test iterations (DEFAULT: 100000)

master(@ARGV);
exit(0);

# master -- master control
sub master
{
    my(@argv) = @_;
    my($arg,$sym);

    while (1) {
        $arg = $argv[0];
        last unless (defined($arg));

        last unless ($arg =~ s/^-(.)//);
        $sym = ;

        shift(@argv);

        $arg = 1
            if ($arg eq "");

        $arg += 0;
        ${"opt_$sym"} = $arg;
    }

    $opt_p //= 1;
    $opt_c //= 1;
    printf("clonetest: p=%d c=%d\n",$opt_p,$opt_c);

    $xfile = shift(@argv);
    $xfile //= "clone1";
    printf("clonetest: xfile='%s'\n",$xfile);

    $itermax = shift(@argv);
    $itermax //= 100000;
    $itermax += 0;
    printf("clonetest: itermax=%d\n",$itermax);

    system("cc -o $xfile -O2 $xfile.c -lpthread");
    $code = $? >> 8;
    die("master: compile error\n")
        if ($code);

    $logf = "/tmp/log";

    for ($iter = 1;  $iter <= $itermax;  ++$iter) {
        printf("iter: %d\n",$iter)
            if ($opt_v);
        dotest($iter);
    }
}

# dotest -- perform single test
sub dotest
{
    my($iter) = @_;
    my($parcnt,$cldcnt);
    my($xfsrc,$bf);

    system("./$xfile > $logf");

    open($xfsrc,"<$logf") or
        die("dotest: unable to open '$logf' -- $!\n");

    while ($bf = <$xfsrc>) {
        chomp($bf);

        if ($opt_p) {
            while ($bf =~ /created/g) {
                ++$parcnt;
            }
        }

        if ($opt_c) {
            while ($bf =~ /func/g) {
                ++$cldcnt;
            }
        }
    }

    close($xfsrc);

    if (($parcnt > 1) or ($cldcnt > 1)) {
        printf("dotest: fail on %d -- parcnt=%d cldcnt=%d\n",
            $iter,$parcnt,$cldcnt);
        system("cat $logf");
        exit(1);
    }
}

更新：

Were you able to recreate OPs problem with clone?

当然可以。在创建 pthreads 版本之前，除了测试 OP 的原始版本外，我还创建了以下版本：

(1) 将 setlinebuf 添加到 main

的开头

(2) 在 clone 和 __fpurge 之前添加 fflush 作为 func

的第一个语句

(3) 在 func return 0

之前添加了一个 fflush

版本(2)消除了重复的父消息，但保留了重复的子消息

如果您想亲眼看看，请从问题、我的版本和测试脚本中下载 OP 版本。然后，运行 OP版本的测试脚本。

我发布了足够的信息和文件，以便任何人都可以重现问题。

请注意，由于我的系统和 OP 之间的差异，我一开始只尝试了 3-4 次就无法重现该问题。所以，这就是我创建脚本的原因。

脚本执行 100,000 次测试运行s，通常问题会在 5000-15000 次内显现。

Answer 5

正如 evaitl 指出的那样 printf 被 glibc 的文档记录为线程安全的。 BUT，这通常假定您正在使用指定的 glibc 函数来创建线程（即 pthread_create()）。如果你不这样做，那你就靠你自己了。

printf()取得的锁是recursive（见flockfile）。这意味着如果锁已经被占用，该实现将根据储物柜检查锁的所有者。如果储物柜与所有者相同，则锁定尝试成功。

要区分不同的线程，您需要正确设置 TLS，您没有这样做，但 pthread_create() 会。我猜发生的是，在您的情况下，标识线程的 TLS 变量对于两个线程都是相同的，因此您最终获得了锁。

TL;DR: 请使用 pthread_create()

克隆人的奇怪行为

Strange behavior of clone

c

linux

multithreading

clone

lightweight-processes