来自 helgrind 的分离 pthread 数据竞争
Detached pthread data race from helgrind
我有一个更大的多线程软件(专有且无法共享)正在报告来自 helgrind 的数据争用(请参阅下面的数据争用)。我不能分享这个软件,但我设计了一些测试来演示比赛。
与实际软件的比赛有问题:
==7746== Possible data race during write of size 1 at 0xAC83697 by thread #4
==7746== Locks held: 2, at addresses 0x583BCD8 0x5846F58
==7746== at 0x4C3A3CC: mempcpy (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==7746== by 0x401375F: _dl_allocate_tls_init (dl-tls.c:515)
==7746== by 0x5053CED: get_cached_stack (allocatestack.c:254)
==7746== by 0x5053CED: allocate_stack (allocatestack.c:501)
==7746== by 0x5053CED: pthread_create@@GLIBC_2.2.5 (pthread_create.c:539)
==7746== by 0x4C34BB7: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==7746== by 0x40BFA6: <redacted symbol names from private project>
==7746== by 0x4C34DB6: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==7746== by 0x50536B9: start_thread (pthread_create.c:333)
==7746==
==7746== This conflicts with a previous write of size 1 by thread #10
==7746== Locks held: none
==7746== at 0x5053622: start_thread (pthread_create.c:265)
==7746== Address 0xac83697 is in a rw- anonymous segment
==7746==
当软件关闭一系列线程然后 re-launches 同一线程池中的一些新线程时,就会出现这种数据竞争。不幸的是,我无法提供任何此代码,但是,我相信我能够重现几个演示问题的示例。
我发现了其他 3 个与此问题相关的问题:
上面的答案是手动 set/allocate 堆栈,我不认为这是一个可行的答案,如果是,有人可以解释为什么吗?
回答没有任何效果
- Data race with detached pthread detected by valgrind
这个没有答案。
编辑: 我在这个 post 的底部添加了另一个(不太复杂的)示例,它也可以重现问题。
我能够将第一个问题中给出的示例重写为可重现性最低的示例,嗯,大部分。
以下代码将在我的机器上 运行 (Ubuntu 16.04.6 LTS)
的大约 85% 的时间内生成以下数据竞争
运行 与:
gcc -g ./test.c -o test -lpthread && valgrind --tool=helgrind ./test
==15656== Possible data race during write of size 1 at 0x5C27697 by thread #4
==15656== Locks held: none
==15656== at 0x4C3A3CC: mempcpy (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==15656== by 0x401375F: _dl_allocate_tls_init (dl-tls.c:515)
==15656== by 0x4E47CED: get_cached_stack (allocatestack.c:254)
==15656== by 0x4E47CED: allocate_stack (allocatestack.c:501)
==15656== by 0x4E47CED: pthread_create@@GLIBC_2.2.5 (pthread_create.c:539)
==15656== by 0x4C34BB7: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==15656== by 0x400832: launch (test3.c:22)
==15656== by 0x4008FC: threadfn3 (test3.c:48)
==15656== by 0x4C34DB6: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==15656== by 0x4E476B9: start_thread (pthread_create.c:333)
==15656==
==15656== This conflicts with a previous write of size 1 by thread #2
==15656== Locks held: none
==15656== at 0x4E47622: start_thread (pthread_create.c:265)
==15656== Address 0x5c27697 is in a rw- anonymous segment
编辑: 我在这个 post 的底部添加了另一个(不太复杂的)示例,它也可以重现问题。
这是我为重现该问题而构建的程序,信号量不是必需的,但它们似乎大大增加了发生数据竞争的机会。
#include <semaphore.h>
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
pthread_t t1;
pthread_t t2;
pthread_t t3;
pthread_t t4;
void *threadfn1(void *p);
void *threadfn2(void *p);
void *threadfn3(void *p);
void *threadfn4(void *p);
sem_t sem;
sem_t sem2;
sem_t sem3;
void launch(pthread_t *t, void *(*fn)(void *), void *arg)
{
pthread_create(t,NULL,fn,arg);
pthread_detach(*t);
}
void *threadfn1(void *p)
{
launch(&t2, threadfn2, NULL);
printf("1 %p\n", p);
// notify threadfn3 we are done
sem_post(&sem);
return NULL;
}
void *threadfn2(void *p)
{
launch(&t3, threadfn3, NULL);
printf("2 %p\n", p);
// notify threadfn4 we are done
sem_post(&sem2);
return NULL;
}
void *threadfn3(void *p)
{
// wait for threadfn1 to finish
sem_wait(&sem);
launch(&t4, threadfn4, NULL);
// wait for threadfn4 to finish
sem_wait(&sem3);
printf("3 %p\n", p);
return NULL;
}
void *threadfn4(void *p)
{
// wait for threadfn2 to finish
sem_wait(&sem2);
printf("4 %p\n", p);
// notify threadfn3 we are done
sem_post(&sem3);
return NULL;
}
int main()
{
sem_init(&sem, 0, 0);
sem_init(&sem2, 0, 0);
sem_init(&sem3, 0, 0);
launch(&t1, threadfn1, NULL);
printf("main\n");
pthread_exit(NULL);
}
这似乎与在 parents 或 parents-of-parents 结束之前结束的线程有关...最终我无法准确找出导致数据竞争发生的原因。
还应该注意的是,在我的测试过程中出现了几次另一个数据竞争,最终我无法可靠地重现它,因为它只是偶尔无缘无故地出现。数据争用与我列出的相同,除了冲突似乎列出了比 "start_thread" 更多的堆栈跟踪,它看起来与上面第一个问题中报告的数据争用完全一样,除了底部它列出了 __libc_thread_freeres:
==15973== Possible data race during write of size 1 at 0x5C27697 by thread #4
==15973== Locks held: none
==15973== at 0x4C3A3CC: mempcpy (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==15973== by 0x401375F: _dl_allocate_tls_init (dl-tls.c:515)
==15973== by 0x4E47CED: get_cached_stack (allocatestack.c:254)
==15973== by 0x4E47CED: allocate_stack (allocatestack.c:501)
==15973== by 0x4E47CED: pthread_create@@GLIBC_2.2.5 (pthread_create.c:539)
==15973== by 0x4C34BB7: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==15973== by 0x400832: launch (test3.c:22)
==15973== by 0x4008FC: threadfn3 (test3.c:48)
==15973== by 0x4C34DB6: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==15973== by 0x4E476B9: start_thread (pthread_create.c:333)
==15973==
==15973== This conflicts with a previous read of size 1 by thread #2
==15973== Locks held: none
==15973== at 0x51C10B1: res_thread_freeres (in /lib/x86_64-linux-gnu/libc-2.19.so)
==15973== by 0x51C1061: __libc_thread_freeres (in /lib/x86_64-linux-gnu/libc-2.19.so)
==15973== by 0x4E45199: start_thread (pthread_create.c:329)
==15973== by 0x515547C: clone (clone.S:111)
不,我不能加入线程,这对我们出现问题的软件不起作用
UPDATE:我一直在做一些测试,并设法生成另一个示例,该示例导致问题的代码少得多。如果您只是启动线程并在循环中分离它们,则会导致数据竞争。
#include <pthread.h>
#include <stdio.h>
// seems we only need 3 threads to cause the problem
#define NUM_THREADS 3
pthread_t t1[NUM_THREADS] = {0};
void launch(pthread_t *t, void *(*fn)(void *), void *arg)
{
pthread_create(t,NULL,fn,arg);
pthread_detach(*t);
}
void *threadfn(void *p)
{
return NULL;
}
int main()
{
int i = NUM_THREADS;
while (i-- > 0) {
launch(t1 + i, threadfn, NULL);
}
return 0;
}
更新 2: 我发现如果你启动所有线程 BEFORE 分离它们中的任何一个似乎可以防止竞争条件体现。请参阅以下 不 生成竞争条件的代码块:
#include <pthread.h>
#define NUM_THREADS 3
pthread_t t1[NUM_THREADS] = {0};
void launch(pthread_t *t, void *(*fn)(void *), void *arg)
{
pthread_create(t,NULL,fn,arg);
}
void *threadfn(void *p)
{
return NULL;
}
int main()
{
int i;
for (i = 0; i < NUM_THREADS; ++i) {
launch(t1 + i, threadfn, NULL);
}
for (i = 0; i < NUM_THREADS; ++i) {
pthread_detach(t1[i]);
}
pthread_exit(NULL);
}
如果在任何 pthread_detach() 调用之后添加另一个 pthread_create() 调用,则竞争条件 re-appears。这让我觉得不可能在不引起数据竞争的情况下使用 pthread_detach() 并随后使用 pthread_create()。
最后我只是重组了所有东西以便我可以加入我的线程,我真的不明白分离线程如何在不导致这种数据竞争的情况下工作。
我有一个更大的多线程软件(专有且无法共享)正在报告来自 helgrind 的数据争用(请参阅下面的数据争用)。我不能分享这个软件,但我设计了一些测试来演示比赛。
与实际软件的比赛有问题:
==7746== Possible data race during write of size 1 at 0xAC83697 by thread #4
==7746== Locks held: 2, at addresses 0x583BCD8 0x5846F58
==7746== at 0x4C3A3CC: mempcpy (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==7746== by 0x401375F: _dl_allocate_tls_init (dl-tls.c:515)
==7746== by 0x5053CED: get_cached_stack (allocatestack.c:254)
==7746== by 0x5053CED: allocate_stack (allocatestack.c:501)
==7746== by 0x5053CED: pthread_create@@GLIBC_2.2.5 (pthread_create.c:539)
==7746== by 0x4C34BB7: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==7746== by 0x40BFA6: <redacted symbol names from private project>
==7746== by 0x4C34DB6: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==7746== by 0x50536B9: start_thread (pthread_create.c:333)
==7746==
==7746== This conflicts with a previous write of size 1 by thread #10
==7746== Locks held: none
==7746== at 0x5053622: start_thread (pthread_create.c:265)
==7746== Address 0xac83697 is in a rw- anonymous segment
==7746==
当软件关闭一系列线程然后 re-launches 同一线程池中的一些新线程时,就会出现这种数据竞争。不幸的是,我无法提供任何此代码,但是,我相信我能够重现几个演示问题的示例。
我发现了其他 3 个与此问题相关的问题:
上面的答案是手动 set/allocate 堆栈,我不认为这是一个可行的答案,如果是,有人可以解释为什么吗?
回答没有任何效果
- Data race with detached pthread detected by valgrind
这个没有答案。
编辑: 我在这个 post 的底部添加了另一个(不太复杂的)示例,它也可以重现问题。
我能够将第一个问题中给出的示例重写为可重现性最低的示例,嗯,大部分。
以下代码将在我的机器上 运行 (Ubuntu 16.04.6 LTS)
的大约 85% 的时间内生成以下数据竞争运行 与:
gcc -g ./test.c -o test -lpthread && valgrind --tool=helgrind ./test
==15656== Possible data race during write of size 1 at 0x5C27697 by thread #4
==15656== Locks held: none
==15656== at 0x4C3A3CC: mempcpy (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==15656== by 0x401375F: _dl_allocate_tls_init (dl-tls.c:515)
==15656== by 0x4E47CED: get_cached_stack (allocatestack.c:254)
==15656== by 0x4E47CED: allocate_stack (allocatestack.c:501)
==15656== by 0x4E47CED: pthread_create@@GLIBC_2.2.5 (pthread_create.c:539)
==15656== by 0x4C34BB7: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==15656== by 0x400832: launch (test3.c:22)
==15656== by 0x4008FC: threadfn3 (test3.c:48)
==15656== by 0x4C34DB6: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==15656== by 0x4E476B9: start_thread (pthread_create.c:333)
==15656==
==15656== This conflicts with a previous write of size 1 by thread #2
==15656== Locks held: none
==15656== at 0x4E47622: start_thread (pthread_create.c:265)
==15656== Address 0x5c27697 is in a rw- anonymous segment
编辑: 我在这个 post 的底部添加了另一个(不太复杂的)示例,它也可以重现问题。
这是我为重现该问题而构建的程序,信号量不是必需的,但它们似乎大大增加了发生数据竞争的机会。
#include <semaphore.h>
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
pthread_t t1;
pthread_t t2;
pthread_t t3;
pthread_t t4;
void *threadfn1(void *p);
void *threadfn2(void *p);
void *threadfn3(void *p);
void *threadfn4(void *p);
sem_t sem;
sem_t sem2;
sem_t sem3;
void launch(pthread_t *t, void *(*fn)(void *), void *arg)
{
pthread_create(t,NULL,fn,arg);
pthread_detach(*t);
}
void *threadfn1(void *p)
{
launch(&t2, threadfn2, NULL);
printf("1 %p\n", p);
// notify threadfn3 we are done
sem_post(&sem);
return NULL;
}
void *threadfn2(void *p)
{
launch(&t3, threadfn3, NULL);
printf("2 %p\n", p);
// notify threadfn4 we are done
sem_post(&sem2);
return NULL;
}
void *threadfn3(void *p)
{
// wait for threadfn1 to finish
sem_wait(&sem);
launch(&t4, threadfn4, NULL);
// wait for threadfn4 to finish
sem_wait(&sem3);
printf("3 %p\n", p);
return NULL;
}
void *threadfn4(void *p)
{
// wait for threadfn2 to finish
sem_wait(&sem2);
printf("4 %p\n", p);
// notify threadfn3 we are done
sem_post(&sem3);
return NULL;
}
int main()
{
sem_init(&sem, 0, 0);
sem_init(&sem2, 0, 0);
sem_init(&sem3, 0, 0);
launch(&t1, threadfn1, NULL);
printf("main\n");
pthread_exit(NULL);
}
这似乎与在 parents 或 parents-of-parents 结束之前结束的线程有关...最终我无法准确找出导致数据竞争发生的原因。
还应该注意的是,在我的测试过程中出现了几次另一个数据竞争,最终我无法可靠地重现它,因为它只是偶尔无缘无故地出现。数据争用与我列出的相同,除了冲突似乎列出了比 "start_thread" 更多的堆栈跟踪,它看起来与上面第一个问题中报告的数据争用完全一样,除了底部它列出了 __libc_thread_freeres:
==15973== Possible data race during write of size 1 at 0x5C27697 by thread #4
==15973== Locks held: none
==15973== at 0x4C3A3CC: mempcpy (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==15973== by 0x401375F: _dl_allocate_tls_init (dl-tls.c:515)
==15973== by 0x4E47CED: get_cached_stack (allocatestack.c:254)
==15973== by 0x4E47CED: allocate_stack (allocatestack.c:501)
==15973== by 0x4E47CED: pthread_create@@GLIBC_2.2.5 (pthread_create.c:539)
==15973== by 0x4C34BB7: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==15973== by 0x400832: launch (test3.c:22)
==15973== by 0x4008FC: threadfn3 (test3.c:48)
==15973== by 0x4C34DB6: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==15973== by 0x4E476B9: start_thread (pthread_create.c:333)
==15973==
==15973== This conflicts with a previous read of size 1 by thread #2
==15973== Locks held: none
==15973== at 0x51C10B1: res_thread_freeres (in /lib/x86_64-linux-gnu/libc-2.19.so)
==15973== by 0x51C1061: __libc_thread_freeres (in /lib/x86_64-linux-gnu/libc-2.19.so)
==15973== by 0x4E45199: start_thread (pthread_create.c:329)
==15973== by 0x515547C: clone (clone.S:111)
不,我不能加入线程,这对我们出现问题的软件不起作用
UPDATE:我一直在做一些测试,并设法生成另一个示例,该示例导致问题的代码少得多。如果您只是启动线程并在循环中分离它们,则会导致数据竞争。
#include <pthread.h>
#include <stdio.h>
// seems we only need 3 threads to cause the problem
#define NUM_THREADS 3
pthread_t t1[NUM_THREADS] = {0};
void launch(pthread_t *t, void *(*fn)(void *), void *arg)
{
pthread_create(t,NULL,fn,arg);
pthread_detach(*t);
}
void *threadfn(void *p)
{
return NULL;
}
int main()
{
int i = NUM_THREADS;
while (i-- > 0) {
launch(t1 + i, threadfn, NULL);
}
return 0;
}
更新 2: 我发现如果你启动所有线程 BEFORE 分离它们中的任何一个似乎可以防止竞争条件体现。请参阅以下 不 生成竞争条件的代码块:
#include <pthread.h>
#define NUM_THREADS 3
pthread_t t1[NUM_THREADS] = {0};
void launch(pthread_t *t, void *(*fn)(void *), void *arg)
{
pthread_create(t,NULL,fn,arg);
}
void *threadfn(void *p)
{
return NULL;
}
int main()
{
int i;
for (i = 0; i < NUM_THREADS; ++i) {
launch(t1 + i, threadfn, NULL);
}
for (i = 0; i < NUM_THREADS; ++i) {
pthread_detach(t1[i]);
}
pthread_exit(NULL);
}
如果在任何 pthread_detach() 调用之后添加另一个 pthread_create() 调用,则竞争条件 re-appears。这让我觉得不可能在不引起数据竞争的情况下使用 pthread_detach() 并随后使用 pthread_create()。
最后我只是重组了所有东西以便我可以加入我的线程,我真的不明白分离线程如何在不导致这种数据竞争的情况下工作。