为什么 JVM 不在 Windows x86 上发出预取指令

Why doesn't the JVM emit prefetch instructions on Windows x86

如标​​题所述,为什么 OpenJDK JVM 不在 Windows x86 上发出预取指令?请参阅 OpenJDK Mercurial @ http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/c49dcaf78a65/src/os_cpu/windows_x86/vm/prefetch_windows_x86.inline.hpp

inline void Prefetch::read (void *loc, intx interval) {}
inline void Prefetch::write(void *loc, intx interval) {}

没有评论,除​​了源代码我没有找到其他资源。我问是因为它针对 Linux x86 这样做,请参阅 http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/c49dcaf78a65/src/os_cpu/linux_x86/vm/prefetch_linux_x86.inline.hpp

inline void Prefetch::read (void *loc, intx interval) {
#ifdef AMD64
  __asm__ ("prefetcht0 (%0,%1,1)" : : "r" (loc), "r" (interval));
#endif // AMD64
}

inline void Prefetch::write(void *loc, intx interval) {
#ifdef AMD64

  // Do not use the 3dnow prefetchw instruction.  It isn't supported on em64t.
  //  __asm__ ("prefetchw (%0,%1,1)" : : "r" (loc), "r" (interval));
  __asm__ ("prefetcht0 (%0,%1,1)" : : "r" (loc), "r" (interval));

#endif // AMD64
}

你引用的文件都有asm代码片段(inline assembler), which is used by some C/C++ software in its own code (as apangin, the JVM expert , mostly in GC code). And there is actually the difference: Linux, Solaris and BSD variants of x86_64 hotspot have prefetches in the hotspot and windows has them disabled/unimplemented which is partially strange, partially unexplainable why, and it may also make JVM bit (some percents; more on platforms without hardware prefetch) slower on Windows, but still will not help to sell more solaris/solaris paid support contracts for Sun/Oracle. that inline asm syntax may be not supported with MS C++ compiler, but _mm_prefetch should (Who will open JDK bug to add it to the file?)。

JVM 热点是 JIT,JITted 代码由 JIT 以字节形式发出(生成)(同时 JIT 可以将代码从其自己的函数复制到生成的代码中或发出对支持函数的调用、预取在热点中以字节形式发出)。我们怎样才能找到它是如何发出的?简单的在线方法是找到一些在线可搜索的 jdk8u 副本(或更好的 cross-reference like metager), for example on github: https://github.com/JetBrains/jdk8u_hotspot and do the search of prefetch or prefetch emit or prefetchr or lir_prefetchr。有一些相关结果:

JVM 中发出的实际字节数 c1 compiler / LIR in jdk8u_hotspot/src/cpu/x86/vm/assembler_x86.cpp:

void Assembler::prefetch_prefix(Address src) {
  prefix(src);
  emit_int8(0x0F);
}

void Assembler::prefetchnta(Address src) {
  NOT_LP64(assert(VM_Version::supports_sse(), "must support"));
  InstructionMark im(this);
  prefetch_prefix(src);
  emit_int8(0x18);
  emit_operand(rax, src); // 0, src
}

void Assembler::prefetchr(Address src) {
  assert(VM_Version::supports_3dnow_prefetch(), "must support");
  InstructionMark im(this);
  prefetch_prefix(src);
  emit_int8(0x0D);
  emit_operand(rax, src); // 0, src
}

void Assembler::prefetcht0(Address src) {
  NOT_LP64(assert(VM_Version::supports_sse(), "must support"));
  InstructionMark im(this);
  prefetch_prefix(src);
  emit_int8(0x18);
  emit_operand(rcx, src); // 1, src
}

void Assembler::prefetcht1(Address src) {
  NOT_LP64(assert(VM_Version::supports_sse(), "must support"));
  InstructionMark im(this);
  prefetch_prefix(src);
  emit_int8(0x18);
  emit_operand(rdx, src); // 2, src
}

void Assembler::prefetcht2(Address src) {
  NOT_LP64(assert(VM_Version::supports_sse(), "must support"));
  InstructionMark im(this);
  prefetch_prefix(src);
  emit_int8(0x18);
  emit_operand(rbx, src); // 3, src
}

void Assembler::prefetchw(Address src) {
  assert(VM_Version::supports_3dnow_prefetch(), "must support");
  InstructionMark im(this);
  prefetch_prefix(src);
  emit_int8(0x0D);
  emit_operand(rcx, src); // 1, src
}

c1 LIR 中的用法:src/share/vm/c1/c1_LIRAssembler.cpp

void LIR_Assembler::emit_op1(LIR_Op1* op) {
  switch (op->code()) { 
...
    case lir_prefetchr:
      prefetchr(op->in_opr());
      break;

    case lir_prefetchw:
      prefetchw(op->in_opr());
      break;

现在我们知道了the opcode lir_prefetchr and can search for it or in OpenGrok xref and lir_prefetchw, to find the only example in src/share/vm/c1/c1_LIR.cpp

void LIR_List::prefetch(LIR_Address* addr, bool is_store) {
  append(new LIR_Op1(
            is_store ? lir_prefetchw : lir_prefetchr,
            LIR_OprFact::address(addr)));
}

还有其他地方定义了预取指令(对于C2,如), the src/cpu/x86/vm/x86_64.ad:

// Prefetch instructions. ...
instruct prefetchr( memory mem ) %{
  predicate(ReadPrefetchInstr==3);
  match(PrefetchRead mem);
  ins_cost(125);

  format %{ "PREFETCHR $mem\t# Prefetch into level 1 cache" %}
  ins_encode %{
    __ prefetchr($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchrNTA( memory mem ) %{
  predicate(ReadPrefetchInstr==0);
  match(PrefetchRead mem);
  ins_cost(125);

  format %{ "PREFETCHNTA $mem\t# Prefetch into non-temporal cache for read" %}
  ins_encode %{
    __ prefetchnta($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchrT0( memory mem ) %{
  predicate(ReadPrefetchInstr==1);
  match(PrefetchRead mem);
  ins_cost(125);

  format %{ "PREFETCHT0 $mem\t# prefetch into L1 and L2 caches for read" %}
  ins_encode %{
    __ prefetcht0($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchrT2( memory mem ) %{
  predicate(ReadPrefetchInstr==2);
  match(PrefetchRead mem);
  ins_cost(125);

  format %{ "PREFETCHT2 $mem\t# prefetch into L2 caches for read" %}
  ins_encode %{
    __ prefetcht2($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchwNTA( memory mem ) %{
  match(PrefetchWrite mem);
  ins_cost(125);

  format %{ "PREFETCHNTA $mem\t# Prefetch to non-temporal cache for write" %}
  ins_encode %{
    __ prefetchnta($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

// Prefetch instructions for allocation.

instruct prefetchAlloc( memory mem ) %{
  predicate(AllocatePrefetchInstr==3);
  match(PrefetchAllocation mem);
  ins_cost(125);

  format %{ "PREFETCHW $mem\t# Prefetch allocation into level 1 cache and mark modified" %}
  ins_encode %{
    __ prefetchw($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchAllocNTA( memory mem ) %{
  predicate(AllocatePrefetchInstr==0);
  match(PrefetchAllocation mem);
  ins_cost(125);

  format %{ "PREFETCHNTA $mem\t# Prefetch allocation to non-temporal cache for write" %}
  ins_encode %{
    __ prefetchnta($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchAllocT0( memory mem ) %{
  predicate(AllocatePrefetchInstr==1);
  match(PrefetchAllocation mem);
  ins_cost(125);

  format %{ "PREFETCHT0 $mem\t# Prefetch allocation to level 1 and 2 caches for write" %}
  ins_encode %{
    __ prefetcht0($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchAllocT2( memory mem ) %{
  predicate(AllocatePrefetchInstr==2);
  match(PrefetchAllocation mem);
  ins_cost(125);

  format %{ "PREFETCHT2 $mem\t# Prefetch allocation to level 2 cache for write" %}
  ins_encode %{
    __ prefetcht2($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

JDK-4453409 所示,预取已在 JDK 1.4 中的 HotSpot JVM 中实现,以加速 GC。那是 15 多年前的事了,现在没有人会记得为什么它没有在 Windows 上实现。我的猜测是 Visual Studio(一直用于在 Windows 上构建 HotSpot)此时基本上不理解预取指令。看起来需要改进的地方。

不管怎样,你问的代码是JVM垃圾收集器内部使用的。这不是 JIT 生成的。 C2 JIT 代码生成器规则在架构定义文件 x86_64.ad, and there are rules 中,用于将 PrefetchReadPrefetchWritePrefetchAllocation 节点转换为相应的 x64 指令。

一个有趣的事实是 PrefetchReadPrefetchWrite 节点没有在代码中的任何地方创建。它们的存在只是为了支持 JDK 中的 Unsafe.prefetchX intrinsics, however, they are removed 9.

JIT生成预取指令的唯一情况是PrefetchAllocation节点。您可以使用 -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly 验证 PREFETCHNTA 确实是在对象分配后生成的, 在 Linux 和 Windows.[=25= 上]

class Test {
    public static void main(String[] args) {
        byte[] b = new byte[0];
        for (;;) {
            b = Arrays.copyOf(b, b.length + 1);
        }
    }
}

java.exe -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly Test

# {method} {0x00000000176124e0} 'main' '([Ljava/lang/String;)V' in 'Test'
  ...
  0x000000000340e512: cmp    [=11=]x100000,%r11d
  0x000000000340e519: ja     0x000000000340e60f
  0x000000000340e51f: movslq 0x24(%rsp),%r10
  0x000000000340e524: add    [=11=]x1,%r10
  0x000000000340e528: add    [=11=]x17,%r10
  0x000000000340e52c: mov    %r10,%r8
  0x000000000340e52f: and    [=11=]xfffffffffffffff8,%r8
  0x000000000340e533: cmp    [=11=]x100000,%r11d
  0x000000000340e53a: ja     0x000000000340e496
  0x000000000340e540: mov    0x60(%r15),%rbp
  0x000000000340e544: mov    %rbp,%r9
  0x000000000340e547: add    %r8,%r9
  0x000000000340e54a: cmp    0x70(%r15),%r9
  0x000000000340e54e: jae    0x000000000340e496
  0x000000000340e554: mov    %r9,0x60(%r15)
  0x000000000340e558: prefetchnta 0xc0(%r9)
  0x000000000340e560: movq   [=11=]x1,0x0(%rbp)
  0x000000000340e568: prefetchnta 0x100(%r9)
  0x000000000340e570: movl   [=11=]x200000f5,0x8(%rbp)  ;   {metadata({type array byte})}
  0x000000000340e577: mov    %r11d,0xc(%rbp)
  0x000000000340e57b: prefetchnta 0x140(%r9)
  0x000000000340e583: prefetchnta 0x180(%r9)    ;*newarray
                                                ; - java.util.Arrays::copyOf@1 (line 3236)
                                                ; - Test::main@9 (line 9)