在两个相同的 Skylake Xeon Gold 6154 系统上测得的不同内核间延迟
Different inter-core latency measured on two identical Skylake Xeon Gold 6154 systems
我们一直在使用两台相同的 Skylake 服务器,软件完全相同,Centos 7 OS 和 BIOS 设置。一切都一样,除了延迟性能。我们的软件使用的是 AVX512。
在测试中,我注意到 AVX512 每次都会降低其中一个系统的性能(增加延迟)。存在显着的性能差异。我检查了一切,都是一样的。
我应该怎么做才能解决这个问题?哪个工具可以提供帮助?
提前致谢..
sudo lshw -class cpu
*-cpu:0
description: CPU
product: Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
vendor: Intel Corp.
vendor_id: GenuineIntel
physical id: 400
bus info: cpu@0
version: Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
slot: CPU1
size: 3GHz
capacity: 4GHz
width: 64 bits
clock: 1010MHz
capabilities: lm fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp x86-64 constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear spec_ctrl intel_stibp flush_l1d
configuration: cores=18 enabledcores=18 threads=18
*-cpu:1 DISABLED
description: CPU [empty]
physical id: 401
slot: CPU2
更新: 在 Peter 的评论之后,我添加了以下示例代码作为示例。
#include <emmintrin.h>
#include <pthread.h>
#include <immintrin.h>
#include <unistd.h>
#include <inttypes.h>
#include <string.h>
#include <stdbool.h>
#include <stdio.h>
#define CACHE_LINE_SIZE 64
/**
* Copy 64 bytes from one location to another,
* locations should not overlap.
*/
static inline __attribute__((always_inline)) void
mov64(uint8_t *dst, const uint8_t *src)
{
__m512i zmm0;
zmm0 = _mm512_load_si512((const void *)src);
_mm512_store_si512((void *)dst, zmm0);
}
#define likely(x) __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)
static inline uint64_t rdtsc(void)
{
union {
uint64_t tsc_64;
__extension__
struct {
uint32_t lo_32;
uint32_t hi_32;
};
} tsc;
__asm__ volatile("rdtsc" :
"=a" (tsc.lo_32),
"=d" (tsc.hi_32));
return tsc.tsc_64;
}
union levels {
__m512i zmm0;
struct {
uint32_t x1;
uint64_t x2;
uint64_t x3;
uint32_t x4;
uint32_t x5;
uint32_t x6;
uint32_t x7;
};
} __attribute__((aligned(CACHE_LINE_SIZE)));
union levels g_shared;
void *worker_loop(void *param)
{
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(16, &cpuset);
pthread_t thread = pthread_self();
pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
union levels lshared;
uint32_t old_x1 = 0;
lshared.x1 = 0;
while (1) {
__asm__ ("" ::: "memory");
lshared.zmm0 = _mm512_load_si512((const void *)&g_shared);
if (unlikely(lshared.x1 <= old_x1)) {
continue;
} else if (unlikely(lshared.x1 != lshared.x7)) {
// printf("%u %u %u %u %u %u\n", lshared.x1, lshared.x3, lshared.x4, lshared.x5, lshared.x6, lshared.x7);
exit(EXIT_FAILURE);
} else {
uint64_t val = rdtsc();
if (val > lshared.x2) {
printf("> (%u) %lu - %lu = %lu\n", lshared.x1, val, lshared.x2, val - lshared.x2);
} else {
printf("< (%u) %lu - %lu = %lu\n", lshared.x1, lshared.x2, val, lshared.x2 - val);
}
}
old_x1 = lshared.x1;
_mm_pause();
}
return NULL;
}
int main(int argc, char *argv[])
{
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(15, &cpuset);
pthread_t thread = pthread_self();
memset(&g_shared, 0, sizeof(g_shared));
pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
pthread_t worker;
pthread_create(&worker, NULL, worker_loop, NULL);
uint32_t val = 1;
union levels lshared;
while (1) {
lshared.x1 = val;
lshared.x2 = rdtsc();
lshared.x3 = val;
lshared.x4 = val;
lshared.x5 = val;
lshared.x6 = val;
lshared.x7 = val;
_mm512_store_si512((void *)&g_shared, lshared.zmm0);
__asm__ ("" ::: "memory");
usleep(100000);
val++;
_mm_pause();
}
return EXIT_SUCCESS;
}
较慢的系统输出:
> (1) 4582365777844442 - 4582365777792564 = 51878
> (2) 4582366077239290 - 4582366077238806 = 484
> (3) 4582366376674782 - 4582366376674346 = 436
> (4) 4582366676044526 - 4582366676041890 = 2636
> (5) 4582366975470562 - 4582366975470134 = 428
> (6) 4582367274899258 - 4582367274898828 = 430
> (7) 4582367574328446 - 4582367574328022 = 424
> (8) 4582367873757956 - 4582367873757532 = 424
> (9) 4582368173187886 - 4582368173187466 = 420
> (10) 4582368472618418 - 4582368472617958 = 460
> (11) 4582368772049720 - 4582368772049236 = 484
> (12) 4582369071481018 - 4582369071480594 = 424
> (13) 4582369370912760 - 4582369370912284 = 476
> (14) 4582369670344890 - 4582369670344212 = 678
> (15) 4582369969776826 - 4582369969776400 = 426
> (16) 4582370269209462 - 4582370269209024 = 438
> (17) 4582370568642626 - 4582370568642172 = 454
> (18) 4582370868076202 - 4582370868075764 = 438
> (19) 4582371167510016 - 4582371167509594 = 422
> (20) 4582371466944326 - 4582371466943892 = 434
> (21) 4582371766379206 - 4582371766378734 = 472
> (22) 4582372065814804 - 4582372065814344 = 460
> (23) 4582372365225608 - 4582372365223068 = 2540
> (24) 4582372664652112 - 4582372664651668 = 444
> (25) 4582372964080746 - 4582372964080314 = 432
> (26) 4582373263510732 - 4582373263510308 = 424
> (27) 4582373562940116 - 4582373562939676 = 440
> (28) 4582373862370284 - 4582373862369860 = 424
> (29) 4582374161800632 - 4582374161800182 = 450
更快的系统输出:
> (1) 9222001841102298 - 9222001841045386 = 56912
> (2) 9222002140513228 - 9222002140512908 = 320
> (3) 9222002439970702 - 9222002439970330 = 372
> (4) 9222002739428448 - 9222002739428114 = 334
> (5) 9222003038886492 - 9222003038886152 = 340
> (6) 9222003338344884 - 9222003338344516 = 368
> (7) 9222003637803702 - 9222003637803332 = 370
> (8) 9222003937262776 - 9222003937262404 = 372
> (9) 9222004236649320 - 9222004236648932 = 388
> (10) 9222004536101876 - 9222004536101510 = 366
> (11) 9222004835554776 - 9222004835554378 = 398
> (12) 9222005135008064 - 9222005135007686 = 378
> (13) 9222005434461868 - 9222005434461526 = 342
> (14) 9222005733916416 - 9222005733916026 = 390
> (15) 9222006033370968 - 9222006033370640 = 328
> (16) 9222006332825872 - 9222006332825484 = 388
> (17) 9222006632280956 - 9222006632280570 = 386
> (18) 9222006931736548 - 9222006931736178 = 370
> (19) 9222007231192376 - 9222007231191986 = 390
> (20) 9222007530648868 - 9222007530648486 = 382
> (21) 9222007830105642 - 9222007830105270 = 372
> (22) 9222008129562750 - 9222008129562382 = 368
> (23) 9222008429020310 - 9222008429019944 = 366
> (24) 9222008728478336 - 9222008728477970 = 366
> (25) 9222009027936696 - 9222009027936298 = 398
> (26) 9222009327395716 - 9222009327395342 = 374
> (27) 9222009626854876 - 9222009626854506 = 370
> (28) 9222009926282324 - 9222009926281936 = 388
> (29) 9222010225734832 - 9222010225734442 = 390
> (30) 9222010525187748 - 9222010525187366 = 382
更新2:在Peter的回答后,我添加了下面的示例代码作为示例来测量同一芯片上不同网状网络路径的延迟,答案的内容是正确的,不同的 cpu 具有不同的 cpu 间延迟。但在所有情况下,相同系统中的一个仍然比另一个慢 25%。
我也不知道会不会影响,不过我才发现慢的CPU有额外的md_clear标志。
总之,我应该怎么做才能解决这个问题?哪个工具可以提供帮助?我如何理解性能差异?
#include <emmintrin.h>
#include <pthread.h>
#include <immintrin.h>
#include <unistd.h>
#include <inttypes.h>
#include <string.h>
#include <stdbool.h>
#include <stdio.h>
#define CACHE_LINE_SIZE 64
/**
* Copy 64 bytes from one location to another,
* locations should not overlap.
*/
static inline __attribute__((always_inline)) void
mov64(uint8_t *dst, const uint8_t *src)
{
__m512i zmm0;
zmm0 = _mm512_load_si512((const void *)src);
_mm512_store_si512((void *)dst, zmm0);
}
#define likely(x) __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)
static inline uint64_t rdtsc(void)
{
union {
uint64_t tsc_64;
__extension__
struct {
uint32_t lo_32;
uint32_t hi_32;
};
} tsc;
__asm__ volatile("rdtsc" :
"=a" (tsc.lo_32),
"=d" (tsc.hi_32));
return tsc.tsc_64;
}
union levels {
__m512i zmm0;
struct {
uint32_t x1;
uint64_t x2;
uint64_t x3;
uint32_t x4;
uint32_t x5;
uint32_t x6;
uint32_t x7;
};
} __attribute__((aligned(CACHE_LINE_SIZE)));
union levels g_shared;
uint32_t g_main_cpu;
uint32_t g_worker_cpu;
void *worker_loop(void *param)
{
_mm_mfence();
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(g_worker_cpu, &cpuset);
pthread_t thread = pthread_self();
pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
union levels lshared;
uint32_t old_x1 = 1;
uint64_t min = 10000, max = 0, sum = 0;
int i = 0;
while (i < 300) {
__asm__ ("" ::: "memory");
lshared.zmm0 = _mm512_load_si512((const void *)&g_shared);
if (unlikely(lshared.x1 <= old_x1)) {
continue;
} else if (unlikely(lshared.x1 != lshared.x7)) {
exit(EXIT_FAILURE);
} else {
uint64_t val = rdtsc();
uint64_t diff = val - lshared.x2;
sum += diff;
if (min > diff)
min = diff;
if (diff > max)
max = diff;
i++;
}
old_x1 = lshared.x1;
_mm_pause();
}
printf("(M=%u-W=%u) min=%lu max=%lu mean=%lu\n", g_main_cpu, g_worker_cpu, min, max, sum / 300);
return NULL;
}
int main(int argc, char *argv[])
{
for (int main_cpu = 2; main_cpu <= 17; ++main_cpu) {
for (int worker_cpu = 2; worker_cpu <= 17; ++worker_cpu) {
if (main_cpu == worker_cpu) {
continue;
}
_mm_mfence();
g_main_cpu = main_cpu;
g_worker_cpu = worker_cpu;
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(g_main_cpu, &cpuset);
pthread_t thread = pthread_self();
memset(&g_shared, 0, sizeof(g_shared));
pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
pthread_t worker;
pthread_create(&worker, NULL, worker_loop, NULL);
uint32_t val = 0;
union levels lshared;
for (int i = 0; i < 350; ++i) {
lshared.x1 = val;
lshared.x2 = rdtsc();
lshared.x3 = val;
lshared.x4 = val;
lshared.x5 = val;
lshared.x6 = val;
lshared.x7 = val;
_mm512_store_si512((void *)&g_shared, lshared.zmm0);
__asm__ ("" ::: "memory");
usleep(100000);
val++;
_mm_pause();
}
pthread_join(worker, NULL);
}
}
return EXIT_SUCCESS;
}
两个系统的输出:(2-17被隔离cpus)
slow cpu fast cpu
------------------------------------
(M=2-W=3) mean=580 mean=374
(M=2-W=4) mean=463 mean=365
(M=2-W=5) mean=449 mean=391
(M=2-W=6) mean=484 mean=345
(M=2-W=7) mean=430 mean=386
(M=2-W=8) mean=439 mean=369
(M=2-W=9) mean=445 mean=376
(M=2-W=10) mean=480 mean=354
(M=2-W=11) mean=440 mean=392
(M=2-W=12) mean=475 mean=324
(M=2-W=13) mean=453 mean=373
(M=2-W=14) mean=474 mean=344
(M=2-W=15) mean=445 mean=384
(M=2-W=16) mean=468 mean=372
(M=2-W=17) mean=462 mean=373
(M=3-W=2) mean=447 mean=392
(M=3-W=4) mean=556 mean=386
(M=3-W=5) mean=418 mean=409
(M=3-W=6) mean=473 mean=372
(M=3-W=7) mean=397 mean=400
(M=3-W=8) mean=408 mean=403
(M=3-W=9) mean=412 mean=413
(M=3-W=10) mean=447 mean=389
(M=3-W=11) mean=412 mean=423
(M=3-W=12) mean=446 mean=399
(M=3-W=13) mean=427 mean=407
(M=3-W=14) mean=445 mean=390
(M=3-W=15) mean=417 mean=448
(M=3-W=16) mean=438 mean=386
(M=3-W=17) mean=435 mean=396
(M=4-W=2) mean=463 mean=368
(M=4-W=3) mean=433 mean=401
(M=4-W=5) mean=561 mean=406
(M=4-W=6) mean=468 mean=378
(M=4-W=7) mean=416 mean=387
(M=4-W=8) mean=425 mean=386
(M=4-W=9) mean=425 mean=415
(M=4-W=10) mean=464 mean=379
(M=4-W=11) mean=424 mean=404
(M=4-W=12) mean=456 mean=369
(M=4-W=13) mean=441 mean=395
(M=4-W=14) mean=460 mean=378
(M=4-W=15) mean=427 mean=405
(M=4-W=16) mean=446 mean=369
(M=4-W=17) mean=448 mean=391
(M=5-W=2) mean=447 mean=382
(M=5-W=3) mean=418 mean=406
(M=5-W=4) mean=430 mean=397
(M=5-W=6) mean=584 mean=386
(M=5-W=7) mean=399 mean=399
(M=5-W=8) mean=404 mean=386
(M=5-W=9) mean=408 mean=408
(M=5-W=10) mean=446 mean=378
(M=5-W=11) mean=411 mean=407
(M=5-W=12) mean=440 mean=385
(M=5-W=13) mean=424 mean=402
(M=5-W=14) mean=442 mean=381
(M=5-W=15) mean=411 mean=411
(M=5-W=16) mean=433 mean=398
(M=5-W=17) mean=429 mean=395
(M=6-W=2) mean=486 mean=356
(M=6-W=3) mean=453 mean=388
(M=6-W=4) mean=471 mean=353
(M=6-W=5) mean=452 mean=388
(M=6-W=7) mean=570 mean=360
(M=6-W=8) mean=444 mean=377
(M=6-W=9) mean=450 mean=376
(M=6-W=10) mean=485 mean=335
(M=6-W=11) mean=451 mean=410
(M=6-W=12) mean=479 mean=353
(M=6-W=13) mean=463 mean=363
(M=6-W=14) mean=479 mean=359
(M=6-W=15) mean=450 mean=394
(M=6-W=16) mean=473 mean=364
(M=6-W=17) mean=469 mean=373
(M=7-W=2) mean=454 mean=365
(M=7-W=3) mean=418 mean=410
(M=7-W=4) mean=443 mean=370
(M=7-W=5) mean=421 mean=407
(M=7-W=6) mean=456 mean=363
(M=7-W=8) mean=527 mean=380
(M=7-W=9) mean=417 mean=392
(M=7-W=10) mean=460 mean=361
(M=7-W=11) mean=421 mean=402
(M=7-W=12) mean=447 mean=354
(M=7-W=13) mean=430 mean=381
(M=7-W=14) mean=449 mean=375
(M=7-W=15) mean=420 mean=393
(M=7-W=16) mean=442 mean=352
(M=7-W=17) mean=438 mean=367
(M=8-W=2) mean=463 mean=382
(M=8-W=3) mean=434 mean=411
(M=8-W=4) mean=452 mean=372
(M=8-W=5) mean=429 mean=402
(M=8-W=6) mean=469 mean=368
(M=8-W=7) mean=416 mean=418
(M=8-W=9) mean=560 mean=418
(M=8-W=10) mean=468 mean=385
(M=8-W=11) mean=429 mean=394
(M=8-W=12) mean=460 mean=378
(M=8-W=13) mean=439 mean=392
(M=8-W=14) mean=459 mean=373
(M=8-W=15) mean=429 mean=383
(M=8-W=16) mean=452 mean=376
(M=8-W=17) mean=449 mean=401
(M=9-W=2) mean=440 mean=368
(M=9-W=3) mean=410 mean=398
(M=9-W=4) mean=426 mean=385
(M=9-W=5) mean=406 mean=403
(M=9-W=6) mean=447 mean=378
(M=9-W=7) mean=393 mean=427
(M=9-W=8) mean=408 mean=368
(M=9-W=10) mean=580 mean=392
(M=9-W=11) mean=408 mean=387
(M=9-W=12) mean=433 mean=381
(M=9-W=13) mean=418 mean=444
(M=9-W=14) mean=441 mean=407
(M=9-W=15) mean=408 mean=401
(M=9-W=16) mean=427 mean=376
(M=9-W=17) mean=426 mean=383
(M=10-W=2) mean=478 mean=361
(M=10-W=3) mean=446 mean=379
(M=10-W=4) mean=461 mean=350
(M=10-W=5) mean=445 mean=373
(M=10-W=6) mean=483 mean=354
(M=10-W=7) mean=428 mean=370
(M=10-W=8) mean=436 mean=355
(M=10-W=9) mean=448 mean=390
(M=10-W=11) mean=569 mean=350
(M=10-W=12) mean=473 mean=337
(M=10-W=13) mean=454 mean=370
(M=10-W=14) mean=474 mean=360
(M=10-W=15) mean=441 mean=370
(M=10-W=16) mean=463 mean=354
(M=10-W=17) mean=462 mean=358
(M=11-W=2) mean=447 mean=384
(M=11-W=3) mean=411 mean=408
(M=11-W=4) mean=433 mean=394
(M=11-W=5) mean=413 mean=428
(M=11-W=6) mean=455 mean=383
(M=11-W=7) mean=402 mean=395
(M=11-W=8) mean=407 mean=418
(M=11-W=9) mean=417 mean=424
(M=11-W=10) mean=452 mean=395
(M=11-W=12) mean=577 mean=406
(M=11-W=13) mean=426 mean=402
(M=11-W=14) mean=442 mean=412
(M=11-W=15) mean=408 mean=411
(M=11-W=16) mean=435 mean=400
(M=11-W=17) mean=431 mean=415
(M=12-W=2) mean=473 mean=352
(M=12-W=3) mean=447 mean=381
(M=12-W=4) mean=461 mean=361
(M=12-W=5) mean=445 mean=366
(M=12-W=6) mean=483 mean=322
(M=12-W=7) mean=431 mean=358
(M=12-W=8) mean=438 mean=340
(M=12-W=9) mean=448 mean=409
(M=12-W=10) mean=481 mean=334
(M=12-W=11) mean=447 mean=351
(M=12-W=13) mean=580 mean=383
(M=12-W=14) mean=473 mean=359
(M=12-W=15) mean=441 mean=385
(M=12-W=16) mean=463 mean=355
(M=12-W=17) mean=462 mean=358
(M=13-W=2) mean=450 mean=385
(M=13-W=3) mean=420 mean=410
(M=13-W=4) mean=440 mean=396
(M=13-W=5) mean=418 mean=402
(M=13-W=6) mean=461 mean=385
(M=13-W=7) mean=406 mean=391
(M=13-W=8) mean=415 mean=382
(M=13-W=9) mean=421 mean=402
(M=13-W=10) mean=457 mean=376
(M=13-W=11) mean=422 mean=409
(M=13-W=12) mean=451 mean=381
(M=13-W=14) mean=579 mean=375
(M=13-W=15) mean=430 mean=402
(M=13-W=16) mean=440 mean=408
(M=13-W=17) mean=439 mean=394
(M=14-W=2) mean=477 mean=330
(M=14-W=3) mean=449 mean=406
(M=14-W=4) mean=464 mean=355
(M=14-W=5) mean=450 mean=389
(M=14-W=6) mean=487 mean=342
(M=14-W=7) mean=432 mean=380
(M=14-W=8) mean=439 mean=360
(M=14-W=9) mean=451 mean=405
(M=14-W=10) mean=485 mean=356
(M=14-W=11) mean=447 mean=398
(M=14-W=12) mean=479 mean=338
(M=14-W=13) mean=455 mean=382
(M=14-W=15) mean=564 mean=383
(M=14-W=16) mean=481 mean=361
(M=14-W=17) mean=465 mean=351
(M=15-W=2) mean=426 mean=409
(M=15-W=3) mean=395 mean=424
(M=15-W=4) mean=412 mean=427
(M=15-W=5) mean=395 mean=425
(M=15-W=6) mean=435 mean=391
(M=15-W=7) mean=379 mean=405
(M=15-W=8) mean=388 mean=412
(M=15-W=9) mean=399 mean=432
(M=15-W=10) mean=432 mean=389
(M=15-W=11) mean=397 mean=432
(M=15-W=12) mean=426 mean=393
(M=15-W=13) mean=404 mean=407
(M=15-W=14) mean=429 mean=412
(M=15-W=16) mean=539 mean=391
(M=15-W=17) mean=414 mean=397
(M=16-W=2) mean=456 mean=368
(M=16-W=3) mean=422 mean=406
(M=16-W=4) mean=445 mean=384
(M=16-W=5) mean=427 mean=397
(M=16-W=6) mean=462 mean=348
(M=16-W=7) mean=413 mean=408
(M=16-W=8) mean=419 mean=361
(M=16-W=9) mean=429 mean=385
(M=16-W=10) mean=463 mean=369
(M=16-W=11) mean=426 mean=404
(M=16-W=12) mean=454 mean=391
(M=16-W=13) mean=434 mean=378
(M=16-W=14) mean=454 mean=412
(M=16-W=15) mean=424 mean=416
(M=16-W=17) mean=578 mean=378
(M=17-W=2) mean=460 mean=402
(M=17-W=3) mean=419 mean=381
(M=17-W=4) mean=446 mean=394
(M=17-W=5) mean=424 mean=422
(M=17-W=6) mean=468 mean=369
(M=17-W=7) mean=409 mean=401
(M=17-W=8) mean=418 mean=405
(M=17-W=9) mean=428 mean=414
(M=17-W=10) mean=459 mean=369
(M=17-W=11) mean=424 mean=387
(M=17-W=12) mean=451 mean=372
(M=17-W=13) mean=435 mean=382
(M=17-W=14) mean=459 mean=369
(M=17-W=15) mean=426 mean=401
(M=17-W=16) mean=446 mean=371
我的猜测:不同的Xeon Gold 6154芯片(18c 36t)有不同的核心因缺陷而熔断,所以你有您固定到 and/or 您的缓存行最终映射到的 L3 缓存片的两个内核之间的 不同的网状网络路径 。这会影响这两个内核之间的内核间延迟。
根据 Wikichip,它 based on the "Extreme Core Count die" for SKX, which has 28 physical cores on it, the core count of the Xeon Platinum 8176 基于相同的芯片。
所以您的芯片上禁用了 10 个内核,但可能是不同的 10 个。这可能意味着某些内核彼此之间的跳数更多(也许)? And/or 这可能意味着内核以不同的顺序枚举,因此相同的硬编码内核编号意味着不同的网格位置。
https://en.wikichip.org/wiki/intel/mesh_interconnect_architecture
您的更新显示了来自所有核心对的新数据。似乎一个 CPU 对于大多数但不是所有的核心对来说都比较慢。 (尽管如果您使用平均值而不丢弃异常值,我并不完全相信该数据。)这仍然可以用不同的网格布局合理地解释,可能大多数核心之间的距离明显更差。
它是一个二维网格,大概反映了核心的物理布局。也许快速 CPU 大部分都在外部禁用了核心,因此活跃的核心相当密集地打包到一个较小的网格中。但也许较慢的那个在网格中有更多 "interior" 个核心的缺陷。
I just realized that the slow CPU has extra md_clear
CPU feature flag.
根据 https://software.intel.com/security-software-guidance/insights/deep-dive-intel-analysis-microarchitectural-data-sampling,md_clear
标志表示微代码支持通过 verw
指令等进行 L1TF/微架构数据采样的变通方法
也许较新的微代码版本还有另一个更改会损害此微基准测试的性能(也可能是整体性能)。或者这可能是巧合。
来自更多 Xeon Gold CPUs 的更多数据以及较旧和较新的微码可能会有所启发。如果我们仍然看到 CPU 之间存在如此大的差异,即使使用相同的微代码,那将支持我的假设,即物理核心被融合为 28 核裸片作为 18 工作核心出售的结果 CPU.
还在基于较小裸片的 Xeon 上进行测试,例如启用所有 14 个内核的 14 核 HCC 裸片,可能会显示更好的最坏情况对内核间延迟。可能需要控制不同的 RDTSC vs turbo vs uncore 频率,除非网格时钟与参考核心时钟成比例。
该解释完全不依赖于 AVX512。您是否看到标量负载具有相同的效果?
此外,如果没有_mm_pause
,可能一个小的时间差异恰好对一个比另一个产生更糟糕的影响;也许一个核心正在看到管道核弹(machine_clears.memory_ordering
性能事件)而另一个不是?
您对 _mm_pause()
的更新主要排除了放大真实延迟的微小差异。不管是什么原因,差异确实有那么大。
您的 CPU 足够新,可以安全地假设 TSC 在核心之间同步,并且大概两者都已经 运行 达到最大涡轮增压。 (其中一个名为 CPU 的特性,constant_tsc
或 invariant_tsc
明确保证了这一点,但我忘了是哪一个。另一个意味着无论核心时钟频率如何,它都以固定的参考频率滴答.nonstop_tsc
表示内核休眠时不停止。)
(TL:DR:我认为您的微基准测试看起来很正常,并且您正在以合理的方式测量内核间延迟,没有巨大的测量误差。)
What should I do to solve this problem?
你不能。
如果低内核间延迟对于一个应用程序至关重要,请尝试几个不同的 CPUs,直到找到延迟低于平均水平的应用程序。
运行 Xeons 上延迟更差的其他应用程序。
或者,如果我的假设是正确的,也许可以得到一个基于高核心数芯片的 14 核 Xeon Gold。启用所有 14 个内核后,这应该是最好的情况。但是那些 Xeons 只有 1 个 AVX512 FMA 单元。
Which tool can help?
如果只有几个线程需要紧密耦合,请在您拥有的 CPU 上找到彼此延迟最低的物理内核集群。将对延迟最敏感的线程固定到这些内核。
如果这适用于您的应用程序,可以考虑基于 4 个物理内核的 CCX 单元的 Zen 或 Zen2 微体系结构,该集群内部具有低延迟,但跨集群的延迟要差得多. AMD 确实有一些多核芯片,但只有 Zen2 在其 load/store 和执行单元中具有完整的 256 位 SIMD 宽度。 (它仍然不支持 AVX512,但如果您的应用程序可以大量使用 SIMD,那么至少全速 AVX2+FMA 可能就是您想要的)。
How can I understand the performance difference?
如果我的假设是正确的,它是制造和销售的 CPU 的固有 属性。英特尔设计了具有 n
个物理内核的裸片。如果制造缺陷损坏了其中一些核心,他们仍然可以将其作为核心数较少的 SKU 出售。 (他们烧掉了物理保险丝,因此禁用的核心不会浪费电力)。大概它的网格节点仍然必须工作,除非他们可以短路整个节点以收紧网格?
当最高核心数 SKU 的产量高于他们想要销售的价格点的需求时,他们将禁用芯片上的一些工作核心和有缺陷的核心。但这通常是物理上的激光保险丝,而不是像旧 GPU 那样的固件,有时您可以破解固件以激活禁用的内核。所以你实际上无能为力。
购买芯片上所有核心都已启用的芯片(例如,"Extreme" 核心数 Xeons 有 28 个核心)意味着没有熔断核心。就内核间延迟的最坏情况对而言,这可能会为我们提供一些有趣的测试数据。
启用所有内核的较低内核数裸片也可能很有趣。https://en.wikichip.org/wiki/Category:microprocessor_models_by_intel_based_on_skylake_high_core_count_die page shows the "high" core count (HCC) SKX die has 14 cores (half the ECC die). The top model using that die is Xeon Gold 5120, a 14c/28t model. (With 1x 512-bit FMA unit per core, not 2). Intel Ark confirms。
如果 HCC 裸片每个内核只有 1 个 FMA 单元,我不会感到惊讶,这与 ECC 裸片包括额外的端口 5 512 位 FMA 单元不同。这将为英特尔销售的所有中档 SKU 节省芯片面积,并且拥有第二个 FMA 单元仅有助于 AVX512 代码。许多代码没有使用 AVX512。 (AVX2 和 AVX512 256 位 FMA 吞吐量在 CPUs 上的端口 0/端口 1 上仍然是 2/时钟。)
我们一直在使用两台相同的 Skylake 服务器,软件完全相同,Centos 7 OS 和 BIOS 设置。一切都一样,除了延迟性能。我们的软件使用的是 AVX512。
在测试中,我注意到 AVX512 每次都会降低其中一个系统的性能(增加延迟)。存在显着的性能差异。我检查了一切,都是一样的。
我应该怎么做才能解决这个问题?哪个工具可以提供帮助?
提前致谢..
sudo lshw -class cpu
*-cpu:0
description: CPU
product: Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
vendor: Intel Corp.
vendor_id: GenuineIntel
physical id: 400
bus info: cpu@0
version: Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
slot: CPU1
size: 3GHz
capacity: 4GHz
width: 64 bits
clock: 1010MHz
capabilities: lm fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp x86-64 constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear spec_ctrl intel_stibp flush_l1d
configuration: cores=18 enabledcores=18 threads=18
*-cpu:1 DISABLED
description: CPU [empty]
physical id: 401
slot: CPU2
更新: 在 Peter 的评论之后,我添加了以下示例代码作为示例。
#include <emmintrin.h>
#include <pthread.h>
#include <immintrin.h>
#include <unistd.h>
#include <inttypes.h>
#include <string.h>
#include <stdbool.h>
#include <stdio.h>
#define CACHE_LINE_SIZE 64
/**
* Copy 64 bytes from one location to another,
* locations should not overlap.
*/
static inline __attribute__((always_inline)) void
mov64(uint8_t *dst, const uint8_t *src)
{
__m512i zmm0;
zmm0 = _mm512_load_si512((const void *)src);
_mm512_store_si512((void *)dst, zmm0);
}
#define likely(x) __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)
static inline uint64_t rdtsc(void)
{
union {
uint64_t tsc_64;
__extension__
struct {
uint32_t lo_32;
uint32_t hi_32;
};
} tsc;
__asm__ volatile("rdtsc" :
"=a" (tsc.lo_32),
"=d" (tsc.hi_32));
return tsc.tsc_64;
}
union levels {
__m512i zmm0;
struct {
uint32_t x1;
uint64_t x2;
uint64_t x3;
uint32_t x4;
uint32_t x5;
uint32_t x6;
uint32_t x7;
};
} __attribute__((aligned(CACHE_LINE_SIZE)));
union levels g_shared;
void *worker_loop(void *param)
{
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(16, &cpuset);
pthread_t thread = pthread_self();
pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
union levels lshared;
uint32_t old_x1 = 0;
lshared.x1 = 0;
while (1) {
__asm__ ("" ::: "memory");
lshared.zmm0 = _mm512_load_si512((const void *)&g_shared);
if (unlikely(lshared.x1 <= old_x1)) {
continue;
} else if (unlikely(lshared.x1 != lshared.x7)) {
// printf("%u %u %u %u %u %u\n", lshared.x1, lshared.x3, lshared.x4, lshared.x5, lshared.x6, lshared.x7);
exit(EXIT_FAILURE);
} else {
uint64_t val = rdtsc();
if (val > lshared.x2) {
printf("> (%u) %lu - %lu = %lu\n", lshared.x1, val, lshared.x2, val - lshared.x2);
} else {
printf("< (%u) %lu - %lu = %lu\n", lshared.x1, lshared.x2, val, lshared.x2 - val);
}
}
old_x1 = lshared.x1;
_mm_pause();
}
return NULL;
}
int main(int argc, char *argv[])
{
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(15, &cpuset);
pthread_t thread = pthread_self();
memset(&g_shared, 0, sizeof(g_shared));
pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
pthread_t worker;
pthread_create(&worker, NULL, worker_loop, NULL);
uint32_t val = 1;
union levels lshared;
while (1) {
lshared.x1 = val;
lshared.x2 = rdtsc();
lshared.x3 = val;
lshared.x4 = val;
lshared.x5 = val;
lshared.x6 = val;
lshared.x7 = val;
_mm512_store_si512((void *)&g_shared, lshared.zmm0);
__asm__ ("" ::: "memory");
usleep(100000);
val++;
_mm_pause();
}
return EXIT_SUCCESS;
}
较慢的系统输出:
> (1) 4582365777844442 - 4582365777792564 = 51878
> (2) 4582366077239290 - 4582366077238806 = 484
> (3) 4582366376674782 - 4582366376674346 = 436
> (4) 4582366676044526 - 4582366676041890 = 2636
> (5) 4582366975470562 - 4582366975470134 = 428
> (6) 4582367274899258 - 4582367274898828 = 430
> (7) 4582367574328446 - 4582367574328022 = 424
> (8) 4582367873757956 - 4582367873757532 = 424
> (9) 4582368173187886 - 4582368173187466 = 420
> (10) 4582368472618418 - 4582368472617958 = 460
> (11) 4582368772049720 - 4582368772049236 = 484
> (12) 4582369071481018 - 4582369071480594 = 424
> (13) 4582369370912760 - 4582369370912284 = 476
> (14) 4582369670344890 - 4582369670344212 = 678
> (15) 4582369969776826 - 4582369969776400 = 426
> (16) 4582370269209462 - 4582370269209024 = 438
> (17) 4582370568642626 - 4582370568642172 = 454
> (18) 4582370868076202 - 4582370868075764 = 438
> (19) 4582371167510016 - 4582371167509594 = 422
> (20) 4582371466944326 - 4582371466943892 = 434
> (21) 4582371766379206 - 4582371766378734 = 472
> (22) 4582372065814804 - 4582372065814344 = 460
> (23) 4582372365225608 - 4582372365223068 = 2540
> (24) 4582372664652112 - 4582372664651668 = 444
> (25) 4582372964080746 - 4582372964080314 = 432
> (26) 4582373263510732 - 4582373263510308 = 424
> (27) 4582373562940116 - 4582373562939676 = 440
> (28) 4582373862370284 - 4582373862369860 = 424
> (29) 4582374161800632 - 4582374161800182 = 450
更快的系统输出:
> (1) 9222001841102298 - 9222001841045386 = 56912
> (2) 9222002140513228 - 9222002140512908 = 320
> (3) 9222002439970702 - 9222002439970330 = 372
> (4) 9222002739428448 - 9222002739428114 = 334
> (5) 9222003038886492 - 9222003038886152 = 340
> (6) 9222003338344884 - 9222003338344516 = 368
> (7) 9222003637803702 - 9222003637803332 = 370
> (8) 9222003937262776 - 9222003937262404 = 372
> (9) 9222004236649320 - 9222004236648932 = 388
> (10) 9222004536101876 - 9222004536101510 = 366
> (11) 9222004835554776 - 9222004835554378 = 398
> (12) 9222005135008064 - 9222005135007686 = 378
> (13) 9222005434461868 - 9222005434461526 = 342
> (14) 9222005733916416 - 9222005733916026 = 390
> (15) 9222006033370968 - 9222006033370640 = 328
> (16) 9222006332825872 - 9222006332825484 = 388
> (17) 9222006632280956 - 9222006632280570 = 386
> (18) 9222006931736548 - 9222006931736178 = 370
> (19) 9222007231192376 - 9222007231191986 = 390
> (20) 9222007530648868 - 9222007530648486 = 382
> (21) 9222007830105642 - 9222007830105270 = 372
> (22) 9222008129562750 - 9222008129562382 = 368
> (23) 9222008429020310 - 9222008429019944 = 366
> (24) 9222008728478336 - 9222008728477970 = 366
> (25) 9222009027936696 - 9222009027936298 = 398
> (26) 9222009327395716 - 9222009327395342 = 374
> (27) 9222009626854876 - 9222009626854506 = 370
> (28) 9222009926282324 - 9222009926281936 = 388
> (29) 9222010225734832 - 9222010225734442 = 390
> (30) 9222010525187748 - 9222010525187366 = 382
更新2:在Peter的回答后,我添加了下面的示例代码作为示例来测量同一芯片上不同网状网络路径的延迟,答案的内容是正确的,不同的 cpu 具有不同的 cpu 间延迟。但在所有情况下,相同系统中的一个仍然比另一个慢 25%。
我也不知道会不会影响,不过我才发现慢的CPU有额外的md_clear标志。
总之,我应该怎么做才能解决这个问题?哪个工具可以提供帮助?我如何理解性能差异?
#include <emmintrin.h>
#include <pthread.h>
#include <immintrin.h>
#include <unistd.h>
#include <inttypes.h>
#include <string.h>
#include <stdbool.h>
#include <stdio.h>
#define CACHE_LINE_SIZE 64
/**
* Copy 64 bytes from one location to another,
* locations should not overlap.
*/
static inline __attribute__((always_inline)) void
mov64(uint8_t *dst, const uint8_t *src)
{
__m512i zmm0;
zmm0 = _mm512_load_si512((const void *)src);
_mm512_store_si512((void *)dst, zmm0);
}
#define likely(x) __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)
static inline uint64_t rdtsc(void)
{
union {
uint64_t tsc_64;
__extension__
struct {
uint32_t lo_32;
uint32_t hi_32;
};
} tsc;
__asm__ volatile("rdtsc" :
"=a" (tsc.lo_32),
"=d" (tsc.hi_32));
return tsc.tsc_64;
}
union levels {
__m512i zmm0;
struct {
uint32_t x1;
uint64_t x2;
uint64_t x3;
uint32_t x4;
uint32_t x5;
uint32_t x6;
uint32_t x7;
};
} __attribute__((aligned(CACHE_LINE_SIZE)));
union levels g_shared;
uint32_t g_main_cpu;
uint32_t g_worker_cpu;
void *worker_loop(void *param)
{
_mm_mfence();
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(g_worker_cpu, &cpuset);
pthread_t thread = pthread_self();
pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
union levels lshared;
uint32_t old_x1 = 1;
uint64_t min = 10000, max = 0, sum = 0;
int i = 0;
while (i < 300) {
__asm__ ("" ::: "memory");
lshared.zmm0 = _mm512_load_si512((const void *)&g_shared);
if (unlikely(lshared.x1 <= old_x1)) {
continue;
} else if (unlikely(lshared.x1 != lshared.x7)) {
exit(EXIT_FAILURE);
} else {
uint64_t val = rdtsc();
uint64_t diff = val - lshared.x2;
sum += diff;
if (min > diff)
min = diff;
if (diff > max)
max = diff;
i++;
}
old_x1 = lshared.x1;
_mm_pause();
}
printf("(M=%u-W=%u) min=%lu max=%lu mean=%lu\n", g_main_cpu, g_worker_cpu, min, max, sum / 300);
return NULL;
}
int main(int argc, char *argv[])
{
for (int main_cpu = 2; main_cpu <= 17; ++main_cpu) {
for (int worker_cpu = 2; worker_cpu <= 17; ++worker_cpu) {
if (main_cpu == worker_cpu) {
continue;
}
_mm_mfence();
g_main_cpu = main_cpu;
g_worker_cpu = worker_cpu;
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(g_main_cpu, &cpuset);
pthread_t thread = pthread_self();
memset(&g_shared, 0, sizeof(g_shared));
pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
pthread_t worker;
pthread_create(&worker, NULL, worker_loop, NULL);
uint32_t val = 0;
union levels lshared;
for (int i = 0; i < 350; ++i) {
lshared.x1 = val;
lshared.x2 = rdtsc();
lshared.x3 = val;
lshared.x4 = val;
lshared.x5 = val;
lshared.x6 = val;
lshared.x7 = val;
_mm512_store_si512((void *)&g_shared, lshared.zmm0);
__asm__ ("" ::: "memory");
usleep(100000);
val++;
_mm_pause();
}
pthread_join(worker, NULL);
}
}
return EXIT_SUCCESS;
}
两个系统的输出:(2-17被隔离cpus)
slow cpu fast cpu
------------------------------------
(M=2-W=3) mean=580 mean=374
(M=2-W=4) mean=463 mean=365
(M=2-W=5) mean=449 mean=391
(M=2-W=6) mean=484 mean=345
(M=2-W=7) mean=430 mean=386
(M=2-W=8) mean=439 mean=369
(M=2-W=9) mean=445 mean=376
(M=2-W=10) mean=480 mean=354
(M=2-W=11) mean=440 mean=392
(M=2-W=12) mean=475 mean=324
(M=2-W=13) mean=453 mean=373
(M=2-W=14) mean=474 mean=344
(M=2-W=15) mean=445 mean=384
(M=2-W=16) mean=468 mean=372
(M=2-W=17) mean=462 mean=373
(M=3-W=2) mean=447 mean=392
(M=3-W=4) mean=556 mean=386
(M=3-W=5) mean=418 mean=409
(M=3-W=6) mean=473 mean=372
(M=3-W=7) mean=397 mean=400
(M=3-W=8) mean=408 mean=403
(M=3-W=9) mean=412 mean=413
(M=3-W=10) mean=447 mean=389
(M=3-W=11) mean=412 mean=423
(M=3-W=12) mean=446 mean=399
(M=3-W=13) mean=427 mean=407
(M=3-W=14) mean=445 mean=390
(M=3-W=15) mean=417 mean=448
(M=3-W=16) mean=438 mean=386
(M=3-W=17) mean=435 mean=396
(M=4-W=2) mean=463 mean=368
(M=4-W=3) mean=433 mean=401
(M=4-W=5) mean=561 mean=406
(M=4-W=6) mean=468 mean=378
(M=4-W=7) mean=416 mean=387
(M=4-W=8) mean=425 mean=386
(M=4-W=9) mean=425 mean=415
(M=4-W=10) mean=464 mean=379
(M=4-W=11) mean=424 mean=404
(M=4-W=12) mean=456 mean=369
(M=4-W=13) mean=441 mean=395
(M=4-W=14) mean=460 mean=378
(M=4-W=15) mean=427 mean=405
(M=4-W=16) mean=446 mean=369
(M=4-W=17) mean=448 mean=391
(M=5-W=2) mean=447 mean=382
(M=5-W=3) mean=418 mean=406
(M=5-W=4) mean=430 mean=397
(M=5-W=6) mean=584 mean=386
(M=5-W=7) mean=399 mean=399
(M=5-W=8) mean=404 mean=386
(M=5-W=9) mean=408 mean=408
(M=5-W=10) mean=446 mean=378
(M=5-W=11) mean=411 mean=407
(M=5-W=12) mean=440 mean=385
(M=5-W=13) mean=424 mean=402
(M=5-W=14) mean=442 mean=381
(M=5-W=15) mean=411 mean=411
(M=5-W=16) mean=433 mean=398
(M=5-W=17) mean=429 mean=395
(M=6-W=2) mean=486 mean=356
(M=6-W=3) mean=453 mean=388
(M=6-W=4) mean=471 mean=353
(M=6-W=5) mean=452 mean=388
(M=6-W=7) mean=570 mean=360
(M=6-W=8) mean=444 mean=377
(M=6-W=9) mean=450 mean=376
(M=6-W=10) mean=485 mean=335
(M=6-W=11) mean=451 mean=410
(M=6-W=12) mean=479 mean=353
(M=6-W=13) mean=463 mean=363
(M=6-W=14) mean=479 mean=359
(M=6-W=15) mean=450 mean=394
(M=6-W=16) mean=473 mean=364
(M=6-W=17) mean=469 mean=373
(M=7-W=2) mean=454 mean=365
(M=7-W=3) mean=418 mean=410
(M=7-W=4) mean=443 mean=370
(M=7-W=5) mean=421 mean=407
(M=7-W=6) mean=456 mean=363
(M=7-W=8) mean=527 mean=380
(M=7-W=9) mean=417 mean=392
(M=7-W=10) mean=460 mean=361
(M=7-W=11) mean=421 mean=402
(M=7-W=12) mean=447 mean=354
(M=7-W=13) mean=430 mean=381
(M=7-W=14) mean=449 mean=375
(M=7-W=15) mean=420 mean=393
(M=7-W=16) mean=442 mean=352
(M=7-W=17) mean=438 mean=367
(M=8-W=2) mean=463 mean=382
(M=8-W=3) mean=434 mean=411
(M=8-W=4) mean=452 mean=372
(M=8-W=5) mean=429 mean=402
(M=8-W=6) mean=469 mean=368
(M=8-W=7) mean=416 mean=418
(M=8-W=9) mean=560 mean=418
(M=8-W=10) mean=468 mean=385
(M=8-W=11) mean=429 mean=394
(M=8-W=12) mean=460 mean=378
(M=8-W=13) mean=439 mean=392
(M=8-W=14) mean=459 mean=373
(M=8-W=15) mean=429 mean=383
(M=8-W=16) mean=452 mean=376
(M=8-W=17) mean=449 mean=401
(M=9-W=2) mean=440 mean=368
(M=9-W=3) mean=410 mean=398
(M=9-W=4) mean=426 mean=385
(M=9-W=5) mean=406 mean=403
(M=9-W=6) mean=447 mean=378
(M=9-W=7) mean=393 mean=427
(M=9-W=8) mean=408 mean=368
(M=9-W=10) mean=580 mean=392
(M=9-W=11) mean=408 mean=387
(M=9-W=12) mean=433 mean=381
(M=9-W=13) mean=418 mean=444
(M=9-W=14) mean=441 mean=407
(M=9-W=15) mean=408 mean=401
(M=9-W=16) mean=427 mean=376
(M=9-W=17) mean=426 mean=383
(M=10-W=2) mean=478 mean=361
(M=10-W=3) mean=446 mean=379
(M=10-W=4) mean=461 mean=350
(M=10-W=5) mean=445 mean=373
(M=10-W=6) mean=483 mean=354
(M=10-W=7) mean=428 mean=370
(M=10-W=8) mean=436 mean=355
(M=10-W=9) mean=448 mean=390
(M=10-W=11) mean=569 mean=350
(M=10-W=12) mean=473 mean=337
(M=10-W=13) mean=454 mean=370
(M=10-W=14) mean=474 mean=360
(M=10-W=15) mean=441 mean=370
(M=10-W=16) mean=463 mean=354
(M=10-W=17) mean=462 mean=358
(M=11-W=2) mean=447 mean=384
(M=11-W=3) mean=411 mean=408
(M=11-W=4) mean=433 mean=394
(M=11-W=5) mean=413 mean=428
(M=11-W=6) mean=455 mean=383
(M=11-W=7) mean=402 mean=395
(M=11-W=8) mean=407 mean=418
(M=11-W=9) mean=417 mean=424
(M=11-W=10) mean=452 mean=395
(M=11-W=12) mean=577 mean=406
(M=11-W=13) mean=426 mean=402
(M=11-W=14) mean=442 mean=412
(M=11-W=15) mean=408 mean=411
(M=11-W=16) mean=435 mean=400
(M=11-W=17) mean=431 mean=415
(M=12-W=2) mean=473 mean=352
(M=12-W=3) mean=447 mean=381
(M=12-W=4) mean=461 mean=361
(M=12-W=5) mean=445 mean=366
(M=12-W=6) mean=483 mean=322
(M=12-W=7) mean=431 mean=358
(M=12-W=8) mean=438 mean=340
(M=12-W=9) mean=448 mean=409
(M=12-W=10) mean=481 mean=334
(M=12-W=11) mean=447 mean=351
(M=12-W=13) mean=580 mean=383
(M=12-W=14) mean=473 mean=359
(M=12-W=15) mean=441 mean=385
(M=12-W=16) mean=463 mean=355
(M=12-W=17) mean=462 mean=358
(M=13-W=2) mean=450 mean=385
(M=13-W=3) mean=420 mean=410
(M=13-W=4) mean=440 mean=396
(M=13-W=5) mean=418 mean=402
(M=13-W=6) mean=461 mean=385
(M=13-W=7) mean=406 mean=391
(M=13-W=8) mean=415 mean=382
(M=13-W=9) mean=421 mean=402
(M=13-W=10) mean=457 mean=376
(M=13-W=11) mean=422 mean=409
(M=13-W=12) mean=451 mean=381
(M=13-W=14) mean=579 mean=375
(M=13-W=15) mean=430 mean=402
(M=13-W=16) mean=440 mean=408
(M=13-W=17) mean=439 mean=394
(M=14-W=2) mean=477 mean=330
(M=14-W=3) mean=449 mean=406
(M=14-W=4) mean=464 mean=355
(M=14-W=5) mean=450 mean=389
(M=14-W=6) mean=487 mean=342
(M=14-W=7) mean=432 mean=380
(M=14-W=8) mean=439 mean=360
(M=14-W=9) mean=451 mean=405
(M=14-W=10) mean=485 mean=356
(M=14-W=11) mean=447 mean=398
(M=14-W=12) mean=479 mean=338
(M=14-W=13) mean=455 mean=382
(M=14-W=15) mean=564 mean=383
(M=14-W=16) mean=481 mean=361
(M=14-W=17) mean=465 mean=351
(M=15-W=2) mean=426 mean=409
(M=15-W=3) mean=395 mean=424
(M=15-W=4) mean=412 mean=427
(M=15-W=5) mean=395 mean=425
(M=15-W=6) mean=435 mean=391
(M=15-W=7) mean=379 mean=405
(M=15-W=8) mean=388 mean=412
(M=15-W=9) mean=399 mean=432
(M=15-W=10) mean=432 mean=389
(M=15-W=11) mean=397 mean=432
(M=15-W=12) mean=426 mean=393
(M=15-W=13) mean=404 mean=407
(M=15-W=14) mean=429 mean=412
(M=15-W=16) mean=539 mean=391
(M=15-W=17) mean=414 mean=397
(M=16-W=2) mean=456 mean=368
(M=16-W=3) mean=422 mean=406
(M=16-W=4) mean=445 mean=384
(M=16-W=5) mean=427 mean=397
(M=16-W=6) mean=462 mean=348
(M=16-W=7) mean=413 mean=408
(M=16-W=8) mean=419 mean=361
(M=16-W=9) mean=429 mean=385
(M=16-W=10) mean=463 mean=369
(M=16-W=11) mean=426 mean=404
(M=16-W=12) mean=454 mean=391
(M=16-W=13) mean=434 mean=378
(M=16-W=14) mean=454 mean=412
(M=16-W=15) mean=424 mean=416
(M=16-W=17) mean=578 mean=378
(M=17-W=2) mean=460 mean=402
(M=17-W=3) mean=419 mean=381
(M=17-W=4) mean=446 mean=394
(M=17-W=5) mean=424 mean=422
(M=17-W=6) mean=468 mean=369
(M=17-W=7) mean=409 mean=401
(M=17-W=8) mean=418 mean=405
(M=17-W=9) mean=428 mean=414
(M=17-W=10) mean=459 mean=369
(M=17-W=11) mean=424 mean=387
(M=17-W=12) mean=451 mean=372
(M=17-W=13) mean=435 mean=382
(M=17-W=14) mean=459 mean=369
(M=17-W=15) mean=426 mean=401
(M=17-W=16) mean=446 mean=371
我的猜测:不同的Xeon Gold 6154芯片(18c 36t)有不同的核心因缺陷而熔断,所以你有您固定到 and/or 您的缓存行最终映射到的 L3 缓存片的两个内核之间的 不同的网状网络路径 。这会影响这两个内核之间的内核间延迟。
根据 Wikichip,它 based on the "Extreme Core Count die" for SKX, which has 28 physical cores on it, the core count of the Xeon Platinum 8176 基于相同的芯片。
所以您的芯片上禁用了 10 个内核,但可能是不同的 10 个。这可能意味着某些内核彼此之间的跳数更多(也许)? And/or 这可能意味着内核以不同的顺序枚举,因此相同的硬编码内核编号意味着不同的网格位置。
https://en.wikichip.org/wiki/intel/mesh_interconnect_architecture
您的更新显示了来自所有核心对的新数据。似乎一个 CPU 对于大多数但不是所有的核心对来说都比较慢。 (尽管如果您使用平均值而不丢弃异常值,我并不完全相信该数据。)这仍然可以用不同的网格布局合理地解释,可能大多数核心之间的距离明显更差。
它是一个二维网格,大概反映了核心的物理布局。也许快速 CPU 大部分都在外部禁用了核心,因此活跃的核心相当密集地打包到一个较小的网格中。但也许较慢的那个在网格中有更多 "interior" 个核心的缺陷。
I just realized that the slow CPU has extra
md_clear
CPU feature flag.
根据 https://software.intel.com/security-software-guidance/insights/deep-dive-intel-analysis-microarchitectural-data-sampling,md_clear
标志表示微代码支持通过 verw
指令等进行 L1TF/微架构数据采样的变通方法
也许较新的微代码版本还有另一个更改会损害此微基准测试的性能(也可能是整体性能)。或者这可能是巧合。
来自更多 Xeon Gold CPUs 的更多数据以及较旧和较新的微码可能会有所启发。如果我们仍然看到 CPU 之间存在如此大的差异,即使使用相同的微代码,那将支持我的假设,即物理核心被融合为 28 核裸片作为 18 工作核心出售的结果 CPU.
还在基于较小裸片的 Xeon 上进行测试,例如启用所有 14 个内核的 14 核 HCC 裸片,可能会显示更好的最坏情况对内核间延迟。可能需要控制不同的 RDTSC vs turbo vs uncore 频率,除非网格时钟与参考核心时钟成比例。
该解释完全不依赖于 AVX512。您是否看到标量负载具有相同的效果?
此外,如果没有_mm_pause
,可能一个小的时间差异恰好对一个比另一个产生更糟糕的影响;也许一个核心正在看到管道核弹(machine_clears.memory_ordering
性能事件)而另一个不是?
您对 _mm_pause()
的更新主要排除了放大真实延迟的微小差异。不管是什么原因,差异确实有那么大。
您的 CPU 足够新,可以安全地假设 TSC 在核心之间同步,并且大概两者都已经 运行 达到最大涡轮增压。 (其中一个名为 CPU 的特性,constant_tsc
或 invariant_tsc
明确保证了这一点,但我忘了是哪一个。另一个意味着无论核心时钟频率如何,它都以固定的参考频率滴答.nonstop_tsc
表示内核休眠时不停止。)
(TL:DR:我认为您的微基准测试看起来很正常,并且您正在以合理的方式测量内核间延迟,没有巨大的测量误差。)
What should I do to solve this problem?
你不能。
如果低内核间延迟对于一个应用程序至关重要,请尝试几个不同的 CPUs,直到找到延迟低于平均水平的应用程序。
运行 Xeons 上延迟更差的其他应用程序。
或者,如果我的假设是正确的,也许可以得到一个基于高核心数芯片的 14 核 Xeon Gold。启用所有 14 个内核后,这应该是最好的情况。但是那些 Xeons 只有 1 个 AVX512 FMA 单元。
Which tool can help?
如果只有几个线程需要紧密耦合,请在您拥有的 CPU 上找到彼此延迟最低的物理内核集群。将对延迟最敏感的线程固定到这些内核。
如果这适用于您的应用程序,可以考虑基于 4 个物理内核的 CCX 单元的 Zen 或 Zen2 微体系结构,该集群内部具有低延迟,但跨集群的延迟要差得多. AMD 确实有一些多核芯片,但只有 Zen2 在其 load/store 和执行单元中具有完整的 256 位 SIMD 宽度。 (它仍然不支持 AVX512,但如果您的应用程序可以大量使用 SIMD,那么至少全速 AVX2+FMA 可能就是您想要的)。
How can I understand the performance difference?
如果我的假设是正确的,它是制造和销售的 CPU 的固有 属性。英特尔设计了具有 n
个物理内核的裸片。如果制造缺陷损坏了其中一些核心,他们仍然可以将其作为核心数较少的 SKU 出售。 (他们烧掉了物理保险丝,因此禁用的核心不会浪费电力)。大概它的网格节点仍然必须工作,除非他们可以短路整个节点以收紧网格?
当最高核心数 SKU 的产量高于他们想要销售的价格点的需求时,他们将禁用芯片上的一些工作核心和有缺陷的核心。但这通常是物理上的激光保险丝,而不是像旧 GPU 那样的固件,有时您可以破解固件以激活禁用的内核。所以你实际上无能为力。
购买芯片上所有核心都已启用的芯片(例如,"Extreme" 核心数 Xeons 有 28 个核心)意味着没有熔断核心。就内核间延迟的最坏情况对而言,这可能会为我们提供一些有趣的测试数据。
启用所有内核的较低内核数裸片也可能很有趣。https://en.wikichip.org/wiki/Category:microprocessor_models_by_intel_based_on_skylake_high_core_count_die page shows the "high" core count (HCC) SKX die has 14 cores (half the ECC die). The top model using that die is Xeon Gold 5120, a 14c/28t model. (With 1x 512-bit FMA unit per core, not 2). Intel Ark confirms。
如果 HCC 裸片每个内核只有 1 个 FMA 单元,我不会感到惊讶,这与 ECC 裸片包括额外的端口 5 512 位 FMA 单元不同。这将为英特尔销售的所有中档 SKU 节省芯片面积,并且拥有第二个 FMA 单元仅有助于 AVX512 代码。许多代码没有使用 AVX512。 (AVX2 和 AVX512 256 位 FMA 吞吐量在 CPUs 上的端口 0/端口 1 上仍然是 2/时钟。)