为什么 FileInputStream 读取的数组越大越慢

Why is it that FileInputStream read is slower with bigger array

如果我将文件中的字节读入字节[],我发现当数组大约为 1 MB 时 FileInputStream 的性能比 128 KB 更差。在我测试过的 2 个工作站上,它的速度几乎是 128 KB 的两倍。这是为什么?

import java.io.*;

public class ReadFileInChuncks 
{
    public static void main(String[] args) throws IOException 
    {
        byte[] buffer1 = new byte[1024*128];
        byte[] buffer2 = new byte[1024*1024];

        String path = "some 1 gb big file";

        readFileInChuncks(path, buffer1, false);

        readFileInChuncks(path, buffer1, true);
        readFileInChuncks(path, buffer2, true);
        readFileInChuncks(path, buffer1, true);
        readFileInChuncks(path, buffer2, true);
    }

    public static void readFileInChuncks(String path, byte[] buffer, boolean report) throws IOException
    {
        long t = System.currentTimeMillis();

        InputStream is = new FileInputStream(path);
        while ((readToArray(is, buffer)) != 0) {}

        if (report)
            System.out.println((System.currentTimeMillis()-t) + " ms");
    }

    public static int readToArray(InputStream is, byte[] buffer) throws IOException
    {
        int index = 0;
        while (index != buffer.length)
        {
            int read = is.read(buffer, index, buffer.length - index);
            if (read == -1)
                break;
            index += read;
        }
        return index;
    }
}

输出

422 ms 
717 ms 
422 ms 
718 ms

请注意,这是对已发布问题的重新定义。另一个被不相关的讨论所污染。我会把另一个标记为删除。

编辑:重复,真的吗?我当然可以编写一些更好的代码来证明我的观点,但是 this 没有回答我的问题

编辑 2:我 运行 在
上对每个缓冲区进行了 5 KB 到 1000 KB 的测试 Win7 / JRE 1.8.0_25 性能不佳从精确的 508 KB 和所有后续的开始。抱歉,图表军团不好,x 是缓冲区大小,y 是毫秒

最佳缓冲区大小取决于文件系统块大小、CPU缓存大小和缓存延迟。 Most os'es 使用块大小 4096 或 8192 因此建议使用具有此大小或此值的多重性的缓冲区。

这可能是因为 cpu 缓存,

cpu 有自己的缓存内存,并且有一些固定大小,您可以通过在 cmd

上执行此命令来检查 cpu 缓存大小

wmic cpu get L2CacheSize

假设您有 256k 作为 cpu 缓存大小, 那么会发生什么情况是,如果您读取 256k 块或更小的块,当读取访问它时,写入缓冲区的内容仍在 CPU 缓存中。如果您有大于 256k 的块,那么最后读取的 256k 位于 CPU 缓存中,因此当读取从头开始时,必须从主内存中检索内容。

我重写了测试以测试不同大小的缓冲区。

这是新代码:

public class ReadFileInChunks {

    public static void main(String[] args) throws IOException {
        String path = "C:\\tmp\1GB.zip";
        readFileInChuncks(path, new byte[1024 * 128], false);

        for (int i = 1; i <= 1024; i+=10) {
            readFileInChuncks(path, new byte[1024 * i], true);
        }
    }

    public static void readFileInChuncks(String path, byte[] buffer, boolean report) throws IOException {
        long t = System.currentTimeMillis();

        InputStream is = new FileInputStream(path);
        while ((readToArray(is, buffer)) != 0) {
        }

        if (report) {
            System.out.println("buffer size = " + buffer.length/1024 + "kB , duration = " + (System.currentTimeMillis() - t) + " ms");
        }
    }

    public static int readToArray(InputStream is, byte[] buffer) throws IOException {
        int index = 0;
        while (index != buffer.length) {
            int read = is.read(buffer, index, buffer.length - index);
            if (read == -1) {
                break;
            }
            index += read;
        }
        return index;
    }

}

这是结果...

buffer size = 121kB , duration = 320 ms
buffer size = 131kB , duration = 330 ms
buffer size = 141kB , duration = 330 ms
buffer size = 151kB , duration = 323 ms
buffer size = 161kB , duration = 320 ms
buffer size = 171kB , duration = 320 ms
buffer size = 181kB , duration = 320 ms
buffer size = 191kB , duration = 310 ms
buffer size = 201kB , duration = 320 ms
buffer size = 211kB , duration = 310 ms
buffer size = 221kB , duration = 310 ms
buffer size = 231kB , duration = 310 ms
buffer size = 241kB , duration = 310 ms
buffer size = 251kB , duration = 310 ms
buffer size = 261kB , duration = 320 ms
buffer size = 271kB , duration = 310 ms
buffer size = 281kB , duration = 320 ms
buffer size = 291kB , duration = 310 ms
buffer size = 301kB , duration = 319 ms
buffer size = 311kB , duration = 320 ms
buffer size = 321kB , duration = 310 ms
buffer size = 331kB , duration = 320 ms
buffer size = 341kB , duration = 310 ms
buffer size = 351kB , duration = 320 ms
buffer size = 361kB , duration = 310 ms
buffer size = 371kB , duration = 320 ms
buffer size = 381kB , duration = 311 ms
buffer size = 391kB , duration = 310 ms
buffer size = 401kB , duration = 310 ms
buffer size = 411kB , duration = 320 ms
buffer size = 421kB , duration = 310 ms
buffer size = 431kB , duration = 310 ms
buffer size = 441kB , duration = 310 ms
buffer size = 451kB , duration = 320 ms
buffer size = 461kB , duration = 310 ms
buffer size = 471kB , duration = 310 ms
buffer size = 481kB , duration = 310 ms
buffer size = 491kB , duration = 310 ms
buffer size = 501kB , duration = 310 ms
buffer size = 511kB , duration = 320 ms
buffer size = 521kB , duration = 300 ms
buffer size = 531kB , duration = 310 ms
buffer size = 541kB , duration = 312 ms
buffer size = 551kB , duration = 311 ms
buffer size = 561kB , duration = 320 ms
buffer size = 571kB , duration = 310 ms
buffer size = 581kB , duration = 314 ms
buffer size = 591kB , duration = 320 ms
buffer size = 601kB , duration = 310 ms
buffer size = 611kB , duration = 310 ms
buffer size = 621kB , duration = 310 ms
buffer size = 631kB , duration = 310 ms
buffer size = 641kB , duration = 310 ms
buffer size = 651kB , duration = 310 ms
buffer size = 661kB , duration = 301 ms
buffer size = 671kB , duration = 310 ms
buffer size = 681kB , duration = 310 ms
buffer size = 691kB , duration = 310 ms
buffer size = 701kB , duration = 310 ms
buffer size = 711kB , duration = 300 ms
buffer size = 721kB , duration = 310 ms
buffer size = 731kB , duration = 310 ms
buffer size = 741kB , duration = 310 ms
buffer size = 751kB , duration = 310 ms
buffer size = 761kB , duration = 311 ms
buffer size = 771kB , duration = 310 ms
buffer size = 781kB , duration = 300 ms
buffer size = 791kB , duration = 300 ms
buffer size = 801kB , duration = 310 ms
buffer size = 811kB , duration = 310 ms
buffer size = 821kB , duration = 300 ms
buffer size = 831kB , duration = 310 ms
buffer size = 841kB , duration = 310 ms
buffer size = 851kB , duration = 300 ms
buffer size = 861kB , duration = 310 ms
buffer size = 871kB , duration = 310 ms
buffer size = 881kB , duration = 310 ms
buffer size = 891kB , duration = 304 ms
buffer size = 901kB , duration = 310 ms
buffer size = 911kB , duration = 310 ms
buffer size = 921kB , duration = 310 ms
buffer size = 931kB , duration = 299 ms
buffer size = 941kB , duration = 321 ms
buffer size = 951kB , duration = 310 ms
buffer size = 961kB , duration = 310 ms
buffer size = 971kB , duration = 310 ms
buffer size = 981kB , duration = 310 ms
buffer size = 991kB , duration = 295 ms
buffer size = 1001kB , duration = 339 ms
buffer size = 1011kB , duration = 302 ms
buffer size = 1021kB , duration = 610 ms

看起来在大约 1021kB 大小的缓冲区处达到了某种阈值。深入了解这一点,我发现...

buffer size = 1017kB , duration = 310 ms
buffer size = 1018kB , duration = 310 ms
buffer size = 1019kB , duration = 602 ms
buffer size = 1020kB , duration = 600 ms

所以看起来当达到这个阈值时会产生某种加倍效应。我最初的想法是 readToArray while 循环在达到阈值时循环了两倍的次数,但事实并非如此,while 循环只经过一次迭代,无论是 300 毫秒 运行 还是 600 毫秒 运行 .因此,让我们看看实际实现的 io_utils.c 从磁盘读取数据以获取一些线索。

jint
readBytes(JNIEnv *env, jobject this, jbyteArray bytes,
          jint off, jint len, jfieldID fid)
{
    jint nread;
    char stackBuf[BUF_SIZE];
    char *buf = NULL;
    FD fd;

    if (IS_NULL(bytes)) {
        JNU_ThrowNullPointerException(env, NULL);
        return -1;
    }

    if (outOfBounds(env, off, len, bytes)) {
        JNU_ThrowByName(env, "java/lang/IndexOutOfBoundsException", NULL);
        return -1;
    }

    if (len == 0) {
        return 0;
    } else if (len > BUF_SIZE) {
        buf = malloc(len);
        if (buf == NULL) {
            JNU_ThrowOutOfMemoryError(env, NULL);
            return 0;
        }
    } else {
        buf = stackBuf;
    }

    fd = GET_FD(this, fid);
    if (fd == -1) {
        JNU_ThrowIOException(env, "Stream Closed");
        nread = -1;
    } else {
        nread = (jint)IO_Read(fd, buf, len);
        if (nread > 0) {
            (*env)->SetByteArrayRegion(env, bytes, off, nread, (jbyte *)buf);
        } else if (nread == JVM_IO_ERR) {
            JNU_ThrowIOExceptionWithLastError(env, "Read error");
        } else if (nread == JVM_IO_INTR) {
            JNU_ThrowByName(env, "java/io/InterruptedIOException", NULL);
        } else { /* EOF */
            nread = -1;
        }
    }

    if (buf != stackBuf) {
        free(buf);
    }
    return nread;
}

需要注意的一件事是 BUF_SIZE 设置为 8192。加倍效果远高于此。所以下一个罪魁祸首就是 IO_Read 方法。

windows/native/java/io/io_util_md.h:#define IO_Read handleRead

所以我们转到 handleRead 方法。

windows/native/java/io/io_util_md.c:handleRead(jlong fd, void *buf, jint len)

此方法将请求传递给名为 ReadFile 的方法。

JNIEXPORT
size_t
handleRead(jlong fd, void *buf, jint len)
{
    DWORD read = 0;
    BOOL result = 0;
    HANDLE h = (HANDLE)fd;
    if (h == INVALID_HANDLE_VALUE) {
        return -1;
    }
    result = ReadFile(h,          /* File handle to read */
                      buf,        /* address to put data */
                      len,        /* number of bytes to read */
                      &read,      /* number of bytes read */
                      NULL);      /* no overlapped struct */
    if (result == 0) {
        int error = GetLastError();
        if (error == ERROR_BROKEN_PIPE) {
            return 0; /* EOF */
        }
        return -1;
    }
    return read;
}

这就是小径 运行 冷的地方......现在。如果我找到 ReadFile 的代码,我会看一看,然后 post 返回。

TL;DR The performance drop is caused by memory allocation, not by file reading issues.

一个典型的基准测试问题:你对一件事进行基准测试,但实际上衡量的是另一件事。

首先,当我使用RandomAccessFileFileChannelByteBuffer.allocateDirect重写示例代码时,阈值消失了。 128K 和 1M 缓冲区的文件读取性能大致相同。

不像直接ByteBuffer I/O FileInputStream.read 不能直接加载数据到Java 字节数组。它需要先将数据放入某个本地缓冲区,然后使用 JNI SetByteArrayRegion 函数将其复制到 Java。

所以我们要看看FileInputStream.read的原生实现。它归结为 io_util.c 中的以下代码:

    if (len == 0) {
        return 0;
    } else if (len > BUF_SIZE) {
        buf = malloc(len);
        if (buf == NULL) {
            JNU_ThrowOutOfMemoryError(env, NULL);
            return 0;
        }
    } else {
        buf = stackBuf;
    }

这里BUF_SIZE == 8192,如果缓冲区大于这个保留栈区,则malloc分配一个临时缓冲区。在 Windows malloc 通常通过 HeapAlloc WINAPI 调用实现。

接下来,我在没有文件 I/O 的情况下单独测量了 HeapAlloc + HeapFree 调用的性能。结果很有趣:

     128K:    5 μs
     256K:   10 μs
     384K:   15 μs
     512K:   20 μs
     640K:   25 μs
     768K:   29 μs
     896K:   33 μs
    1024K:  316 μs  <-- almost 10x leap
    1152K:  356 μs
    1280K:  399 μs
    1408K:  436 μs
    1536K:  474 μs
    1664K:  511 μs
    1792K:  553 μs
    1920K:  592 μs
    2048K:  628 μs

如您所见,OS 内存分配的性能在 1MB 边界处发生了巨大变化。这可以通过用于小块和大块的不同分配算法来解释。

更新

HeapCreate 的文档证实了关于大于 1MB 的块的特定分配策略的想法(参见 dwMaximumSize 描述)。

Also, the largest memory block that can be allocated from the heap is slightly less than 512 KB for a 32-bit process and slightly less than 1,024 KB for a 64-bit process.

...

Requests to allocate memory blocks larger than the limit for a fixed-size heap do not automatically fail; instead, the system calls the VirtualAlloc function to obtain the memory that is needed for large blocks.